Fei-Fei Li & Justin Johnson (World La...

1. Key Themes

Spatial Intelligence as the Next Frontier Beyond Language Models

World Labs is built on the thesis that language models represent only one dimension of intelligence, and that spatial intelligence — the ability to reason, understand, move, and interact in 3D space — is the next major frontier. Fei-Fei Li frames this not as a replacement but as a complement to linguistic AI.

"A lot of AI as a field, as a discipline, is inspired by human intelligence... there is a psychologist, I think his name is Howard Gardner, in the 1960s, actually literally called multiple intelligence to describe human intelligence. And there is linguistic intelligence, there's spatial intelligence, there is logical intelligence and emotional intelligence. So for me, when I think about spatial intelligence, I see it as complementary to language intelligence." — Fei-Fei Li [00:42:44]

Marble: A Generative Model of 3D Worlds With Real Commercial Utility Today

World Labs' first product, Marble, generates 3D worlds from multimodal inputs (text, image, multiple images) and enables interactive editing. The founders deliberately designed it to be both a research milestone and an immediately useful product, avoiding the trap of a pure science project.

"While Marble is simultaneously a world model that is building towards this vision of spatial intelligence, it was also very intentionally designed to be a thing that people could find useful today. And we're starting to see emerging use cases in gaming, in VFX, in film, where I think there's a lot of really interesting stuff that Marble can do today as a product." — Justin Johnson [00:32:21]

Gaussian Splats as the Native Atomic Unit of 3D Generation

Marble natively outputs Gaussian splats — tiny semi-transparent particles with position and orientation in 3D space — enabling real-time rendering on mobile and VR devices. This is a foundational architectural choice with significant downstream implications for interactivity and physics simulation.

"The model natively output splats. So Gaussian splats are these like, you know, each one is a tiny, tiny particle that's semi-transparent, has a position orientation in 3D space. And the scene is built up from a large number of these Gaussian splats. And Gaussian splats are really cool because you can render them in real time really efficiently. So you can render on your iPhone, render everything." — Justin Johnson [00:34:12]

The Deep Learning Scaling Thesis Applied to Spatial/Visual Data

The founders argue that just as language benefited enormously from scaling compute, visual and spatial data will absorb the next wave of compute coming online — and that we are now at a million-fold increase in compute capacity versus early AlexNet days.

"Even from AlexNet to today, we're getting about a thousand times more performance per card than we had in AlexNet days. And now it's common to train models not just on one GPU, but on hundreds or thousands or tens of thousands or even more. So the amount of compute that we can marshal today on a single model is, you know, about a million fold more than we could have even at the start of my PhD." — Justin Johnson [00:04:12]

The Transformer is a Set Model, Not a Sequence Model — A Critical Architectural Insight

Justin Johnson makes a non-obvious but important point: transformers are natively models of sets, not sequences, and their sequential behavior is purely an artifact of positional embeddings. This has profound implications for applying transformers to 3D/spatial data.

"A transformer is actually not a model of a sequence of tokens. A transformer is actually a model of a set of tokens, right? The only thing that gives, that injects the order into it, in the standard transformer architecture, the only thing that differentiates the order of the things is the positional embedding that you give the tokens... all the operators that happen inside a transformer block are either token-wise... And then you have interactions between tokens through the attention mechanism. But that's also sort of... It's permutation equivariant." — Justin Johnson [00:56:44]

Synthetic Data via World Models as a Critical Unlock for Robotics

A key but understated use case for Marble is generating synthetic simulated environments for training embodied/robotic agents. Fei-Fei Li argues this is a critical middle ground between scarce real-world data and uncontrollable internet video.

"Robotic training really lacks data. You know, high fidelity real world data is absolutely very critical, but you're just not going to get a ton of that... simulation and synthetic data is actually a very important middle ground for that. I've been working in this space for many years and one of the biggest pain point is where do you get the synthetic simulated data? You have to curate assets and build these, compose these complex situations. And marble actually is a really potential for helping to generate these synthetic simulated worlds for embodied agent training." — Fei-Fei Li [00:38:57]

Academia Is Severely Under-Resourced, Not Irrelevant

Fei-Fei Li distinguishes between academia losing relevance (which she disputes) versus academia being starved of compute resources. She has been actively lobbying for policy solutions including a National AI Research Resource (NAIRR).

"I think the problem right now is that academia by itself is severely under-resourced so that, you know, the researchers and the students do not have enough resources to try these ideas." — Fei-Fei Li [00:11:14]

Physics Engines Will Not Be Replaced But Will Be Augmented by Neural World Models

The founders take a nuanced position: classical physics engines are imperfect and don't generalize, but that doesn't mean they should be discarded. The opportunity is in augmenting or hybridizing them with learned models, including attaching physical properties directly to Gaussian splats.

"In some sense, the reason that you want to build these things at all is because maybe traditional physics engines don't work in some situations. If a physics engine was perfect, we would have sort of no need to build models because the problem would have already been solved. So in some sense, the reason why we want to do this is because classical physics engines don't solve problems in the generality that we want. But that doesn't mean we need to throw them away and start everything from scratch." — Justin Johnson [00:28:32]

2. Contrarian Perspectives

Spatial Intelligence Has Been Undervalued Precisely Because It Is Effortless for Humans

Most AI discourse focuses on language and reasoning as the pinnacle of intelligence. Fei-Fei Li inverts this: nature spent 540 million years optimizing spatial perception versus at most half a million years on language. The effortlessness of human vision has caused the field to systematically underestimate its difficulty and importance.

"I always find that vision is underappreciated because it's effortless for humans. You open your eyes as a baby, you start to see your world. We're somehow born with it... whereas something that nature spends way more time actually optimizing, which is perception and spatial intelligence, is underappreciated by humans... it took 540 million years to optimize perception and spatial intelligence and language. The most generous estimation of language development is probably half a million years." — Fei-Fei Li [00:47:53]

LLMs Have Skipped Directly to Abstracted Reasoning Without Grounding — and That's a Fundamental Gap

Most people view LLMs as increasingly general intelligence. Justin Johnson argues they have leapfrogged the foundational embodied, spatial layer of cognition entirely, and that this represents a genuine gap rather than an engineering detail to be patched.

"LLMs have just like jumped all the way to those highest forms of abstracted reasoning, which is very interesting and very useful. But spatial intelligence is almost like opening up that black box again and saying maybe we've lost something by going straight to that fully abstracted form of language and reasoning and communication." — Justin Johnson [00:46:56]

The GPU/NVIDIA Hardware Paradigm Is Already Hitting Its Scaling Wall

Against the conventional wisdom that NVIDIA's dominance will persist indefinitely and that software will absorb any hardware gaps, Justin Johnson points out that performance per watt is already plateauing from Hopper to Blackwell — suggesting a genuine opening for new computing paradigms.

"Even going from Hopper to Blackwell, like the performance per watt is about the same. Yes. They mostly make the number of transistors go up and they make the chip size go up and they make the power usage go up. But even from Hopper to Blackwell, we're kind of already seeing like a scaling limit in terms of what is the performance per watt that we can get." — Justin Johnson [00:12:45]

Large Language Models Will Predict Accurate Trajectories but Will Never Derive F=MA

Most AI optimists assume emergent capabilities will eventually yield physical laws. Fei-Fei Li takes the contrarian view that LLMs will achieve accurate empirical predictions of phenomena like orbital mechanics while being categorically incapable of discovering the underlying abstract laws.

"Giving enough celestial movement data, an LLM would actually predict pretty accurate movement trajectories. Let's say I invent a planet surrounding a star and giving enough data, my model would tell you, you know, on day one where it is, day two where it is. I wouldn't be surprised. But F equals MA or, you know, action equals reaction. That's just a whole different abstraction level. That's beyond just today's LLM." — Fei-Fei Li [00:52:28]

Neural Networks Should Be Redesigned Around Future Hardware Primitives, Not Current Ones

While the field is almost entirely focused on optimizing for GPU-based matrix multiplication, Justin Johnson argues that as compute scales from single devices to massive distributed clusters, the architectural primitives underlying neural networks may need to be fundamentally reconceived.

"Just as transformers are based around matrix multiplication and matrix multiplication is sort of the primitive that works really well on GPUs — as you imagine hardware scaling out, are there other primitives that make more sense for large scale distributed systems that we could build our neural networks on? And I think it's possible that there could be drastically different architectures that fit with the next generation or like the hardware that's going to come 10 or 20 years down the line." — Justin Johnson [00:12:03]

3. Companies Identified

World Labs

Spatial intelligence model company co-founded by Fei-Fei Li and Justin Johnson building world models. Their first product, Marble, generates interactive 3D worlds from text and image inputs using Gaussian splats. Targeting gaming, VFX, film, interior design, architecture, robotic simulation, and embodied AI training. Why mentioned: the primary subject of the episode; founders describe the technical architecture, commercial strategy, and long-term vision in detail.

"We are a model of spatial intelligence model company. We believe spatial intelligence is the next frontier... Marble is the first in-class model in the world that generates 3D worlds in this level of fidelity that is in the hands of the public." — Fei-Fei Li [00:30:25]

Google (DeepMind / Google Brain)

Simultaneously and independently developed image captioning research alongside Fei-Fei Li and Andrej Karpathy's Stanford group in 2014-2015. Also referenced for DeepSeek-style pixel-based language modeling experiments. Why mentioned: demonstrates the competitive frontier of multimodal AI research and the pace of parallel discovery.

"We thought we were the first people doing it. It turned out that Google at that time was also simultaneously doing it." — Fei-Fei Li [00:15:28]

Stanford HAI (Institute for Human-Centered AI)

Stanford's interdisciplinary AI institute co-founded/co-directed by Fei-Fei Li, advocating for public sector and academic AI resourcing. Why mentioned: vehicle through which Fei-Fei has been lobbying policymakers for the National AI Research Resource (NAIRR).

"I've been, you know, working with policymakers about resourcing public sector and academic AI work, right? We work with the first Trump administration on this bill called National AI Research Resource, NAYER bill, which is scoping out a national AI compute cloud as well as data repository." — Fei-Fei Li [00:08:00]

NVIDIA

Referenced as the dominant GPU hardware provider, with Hopper and Blackwell architectures specifically called out as evidence that performance-per-watt scaling is plateauing. Why mentioned: central to the contrarian hardware scaling argument.

"Even going from Hopper to Blackwell, like the performance per watt is about the same." — Justin Johnson [00:12:45]

4. People Identified

Fei-Fei Li

Co-founder and CEO of World Labs; Professor of Computer Science at Stanford; co-director of Stanford HAI; creator of ImageNet. One of the most consequential figures in modern AI, having catalyzed the deep learning revolution through ImageNet. Why mentioned: co-founder describing World Labs' vision and her intellectual lineage.

"When I graduated from grad school, I really thought the rest of my entire career would be towards solving that single problem, which is given a picture or given a scene, tell the story in natural language. But things evolved so fast." — Fei-Fei Li [00:14:33]

Justin Johnson

Co-founder of World Labs; former professor at University of Michigan; former researcher at Meta. PhD student of Fei-Fei Li who joined her lab the same quarter AlexNet was released. Architect of Marble's technical design. Why mentioned: deep technical architect of World Labs; authored the Marble technical blog; brings a distinct hardware and systems perspective.

"After that, seeing that kind of ImageNet era during my PhD, I had the sense that the next sort of decade of computer vision was going to be about getting AI out of the data center and out into the world." — Justin Johnson [00:02:44]

Andrej Karpathy

Former Stanford PhD student of Fei-Fei Li; co-developed the original image captioning work and LSTM language modeling papers with Fei-Fei Li and Justin Johnson in 2014-2015; later co-founded OpenAI and led Tesla AI. Why mentioned: intellectual origin of multimodal AI connecting vision and language — the direct intellectual precursor to world models.

"Andre and I were just talking about this has been a long term dream of mine... maybe combining the representation of convolutional neural network as well as the language sequential model of LSTM... we might be able to learn through training to match caption with images." — Fei-Fei Li [00:14:33]

Yann LeCun

Referenced as perhaps the most prominent and longstanding proponent of world models as an AI paradigm. Why mentioned: intellectual predecessor to the World Labs thesis; validates the long research arc behind the company's direction.

"It's an idea that has been out there, right? There's been, you know, Yann LeCun is maybe like the most, the biggest proponent, most prominent of it." — Swix (host) [00:03:31]

Howard Gardner

Psychologist referenced for his theory of multiple intelligences (linguistic, spatial, logical, emotional), used by Fei-Fei Li to frame spatial intelligence as a distinct and equally important dimension of cognition. Why mentioned: provides the academic framework for World Labs' core thesis.

"There is a psychologist, I think his name is Howard Gardner, in the 1960s, actually literally called multiple intelligence to describe human intelligence." — Fei-Fei Li [00:42:44]

John Markov

New York Times reporter who broke the simultaneous Google/Stanford image captioning story in 2015 after accidentally learning about Fei-Fei Li and Andrej Karpathy's work while covering Google's research. Why mentioned: illustrates the competitive simultaneity of breakthrough AI discoveries.

"A reporter was John Markov from New York Times was breaking the Google story. But he by accident heard about us. And then he realized that we really independently got there together at the same time." — Fei-Fei Li [00:15:28]

Dario Acemoglu (misattributed as "Dario" — context suggests Dario Amodei of Anthropic)

Referenced for the claim of "a data center full of Einsteins" as a metaphor for LLM intelligence. Fei-Fei Li explicitly expresses skepticism about this framing. Why mentioned: foil for the argument that linguistic/reasoning intelligence is not the whole story.

"I don't understand that sentence, a data center full of Einsteins. I just don't understand that." — Fei-Fei Li [00:42:44]

5. Operating Insights

Build Research Products That Serve Both Scientific and Commercial Goals Simultaneously — Don't Let Them Diverge

World Labs explicitly designed Marble to advance their core research vision while also being immediately commercially useful. This dual-purpose approach prevents the "science project" trap and generates real user feedback that informs the model's development. The key discipline is designing the research artifact so that it generates the research signal you need while also delivering tangible value to paying customers.

"We are a company, we're a business. We were really trying not to have this be a science project, but also build a product that would be useful to people in the real world today... we actually tried to do sort of two things simultaneously. And I think we managed to pull off the balance pretty well." — Justin Johnson [00:31:22]

Horizontal Platform Strategy Yields Emergent Vertical Use Cases at No Extra R&D Cost

World Labs did not build anything specific for interior design, yet users are already deploying Marble for kitchen remodeling and architects are using it for space planning. By building a powerful general-purpose technology, specific vertical applications emerge without dedicated investment, which dramatically improves capital efficiency in early product development.

"Because it's a powerful horizontal technology, you kind of get these emergent use cases that just fall out of the model. We have early beta users using an API key that is already building for interior design use case." — Justin Johnson [00:41:46]

Demo as Recruiting and Discovery Tool — Go Beyond the Paper

Justin Johnson's habit of building live real-time demos (streaming model inference from Stanford to a conference in Santiago at 1 FPS) created outsized visibility and demonstrated engineering seriousness beyond academic publishing. Fei-Fei Li explicitly notes this set him apart as a student — and it's a model for how technical founders should think about product credibility before they have a product.

"Most of my graduate students would be satisfied if they can publish the paper, right? They package the research, put it in a paper. But Justin went a step further. He's like, I want to do this real-time web demo." — Fei-Fei Li [00:19:25]

6. Overlooked Insights

The NAIRR (National AI Research Resource) Is a Government-Backed Compute Cloud That Could Reshape Academic AI Power Dynamics

Fei-Fei Li mentions almost in passing that she worked with the first Trump administration on legislation to establish a national AI compute cloud and data repository — the National AI Research Resource. This is not a theoretical proposal; it was actively legislated. If this resource scales, it could dramatically rebalance the power dynamic between compute-rich private labs and under-resourced universities, potentially reviving a wave of open, academically-driven AI research that is currently being suppressed by resource constraints. For investors, this could seed an entirely new generation of academic spinouts with foundational model access they previously lacked.

"We work with the first Trump administration on this bill called National AI Research Resource, NAYER bill, which is scoping out a national AI compute cloud as well as data repository." — Fei-Fei Li [00:08:00]

The Future Architecture of Neural Networks May Be Fundamentally Different — and Nobody Is Building It Yet

Justin Johnson throws out — almost as an aside — that as the unit of compute shifts from individual GPUs to massive distributed clusters, the mathematical primitives underlying neural networks (currently matrix multiplication, optimized for GPUs) may need to be entirely reconceived. He explicitly frames this as multi-year academic work, not a startup play — which means it is almost certainly being neglected. The company or research group that identifies the right primitive for distributed-native neural architectures 10 years from now could be as foundational as the GPU itself was to the current era. This is a directional bet for long-horizon deep tech investors or academic lab funders.

"Are there other primitives that make more sense for large scale distributed systems that we could build our neural networks on? And I think it's possible that there could be drastically different architectures that fit with the next generation or like the hardware that's going to come 10 or 20 years down the line. And we could start imagining that today." — Justin Johnson [00:12:03]