Ankit Gupta | Lightcone Summary

Podcast: Lightcone (Decoded) | Participants: Ankit Gupta, Francois Chaubard

1. Key Themes

The Fundamental Reasoning Ceiling of LLMs Is Architectural, Not Just Scale-Limited

LLMs have a hard theoretical ceiling on certain reasoning tasks because they are feed-forward models with no external memory. The core issue isn't that they need more parameters — it's that some problems are computationally incompressible within a fixed number of transformer layers.

"It's like literally that we know a theoretical lower bound that for comparison sort you can't do better than n log n steps. And if I have a list that's 31 characters or elements long and my transformer is 30, I run out of steps to do comparisons." — Francois Chaubard 00:04:36

"A lot of the view of what these LLMs are doing is finding really amazing embedding representation spaces. But reasoning inside that space is actually not done all that much — it's always through the token space." — Francois Chaubard 00:37:09

Recursion at Inference Time Is a New and Viable Scaling Axis

Rather than making models bigger, applying the same weights recursively — with a persistent latent memory state — allows tiny models to outperform massive ones on hard reasoning benchmarks. This is a genuinely different axis of scaling.

"A 7 million parameter model can solve what a 100 billion parameter model can't solve trained on the entire internet, and a 7 million parameter wins." — Francois Chaubard 00:35:53

"It is sufficient, not necessary to go bigger and get better performance, and it is sufficient and not necessary to add more recursion. And so where I'm really excited is what happens if you do both." — Francois Chaubard 00:33:55

Chain of Thought and Tool Use Are Bounded by Human Knowledge — Recursion Is Not

Chain of thought and tool use are workarounds to the feed-forward limitation, but they cannot produce genuinely novel reasoning beyond the training distribution. Recursive latent-space models can actually discover solutions without being teacher-forced.

"In both cases, both hacks to solve this in COT and tool use, you're bounded by the bounds of human knowledge. In the event it's outside the set of human knowledge, then you're kind of SOL." — Francois Chaubard 00:19:46

"If we had Sudoku and we know how to solve Sudoku, because we were just dumb homo sapiens that didn't know how to solve Sudoku, it would just have solved it. And that's why it's cool because it actually is able to discover things without being teacher forced via chain of thought." — Francois Chaubard 00:27:17

2. Contrarian Perspectives

Bio-Plausibility Is Intellectually Useful But a Trap for Engineering Progress

Most ML advances that cited biological inspiration later discarded those very biological elements for something that runs better on a GPU. Founding a research direction on brain-inspired architecture is a marketing move, not a scientific compass.

"Maybe you need to do it to get accepted into NeurIPS. Yeah, sure." — Francois Chaubard 00:15:14

"I tend to not be bounded by bio-plausibility when I think about what machine learning systems we should prioritize working on or think of as particularly exciting, other than as an interesting scientific launching point for a deeper exploration." — Ankit Gupta 00:16:18

More Test-Time Compute in These Recursive Models Is Largely Wasteful

Counterintuitively, training with 16 outer refinement loops and testing with only 1 yields nearly the same performance. The recursion value is mostly captured during training, not testing — the opposite of what most people assume.

"If you actually train on 16 and you test on only one, you get like seven eighths of the performance or like almost all the performance. So it's actually quite interesting that this is just over done too much compute and it doesn't actually help you all that much." — Francois Chaubard 00:30:19

"Train time recursion was important but test time recursion was actually not that important. Which is kind of counter-intuitive." — Francois Chaubard 00:31:05

We Don't Actually Know Why Truncated Backprop Through Time Works — And That's a Warning Sign

The mathematical justification offered (fixed point iteration / deep equilibrium models) has been shown not to hold in practice. The method works empirically, but the theoretical basis has been invalidated, which means the field is building on something it doesn't understand.

"That math holds and it works. It follows DEQ directly in the event that the ZL and the delta ZH go to zero, which it actually doesn't do. And so we actually don't know why it's really working." — Francois Chaubard 00:13:38

Larger Models Are Not Necessary for State-of-the-Art Reasoning on Hard Tasks

The conventional wisdom is that frontier performance requires frontier-scale models. HRM and TRM directly falsify this for specific hard reasoning domains, with a 27M parameter model trained from scratch on ~1,000 examples beating GPT o3 at zero.

"This was only a 27 million parameter model that was only trained on ArcPrize. There's literally a thousand tasks. There is no pre-training at all. This starts from literally tabula rasa weights. And it can outperform — o3 gets zero. Literally zero. And this got like something like 70% on ArcPrize 1." — Francois Chaubard 00:10:03

Deeper Is Not Always Better — Even Within Recursive Architectures

The TRM paper shows that using a single transformer layer and recursing more times outperforms using four layers. And on Sudoku specifically, an MLP outperforms attention entirely — suggesting transformer attention may be solving the wrong subproblem in some domains.

"Going deeper actually didn't help. And actually on some tests, it was just the feed forward net that works just as well as a transformer there — on Sudoku, MLP actually outperformed the attention." — Ankit Gupta / Francois Chaubard 00:31:44

3. Companies Identified

Google DeepMind / Gemini

Major AI lab, producer of the Gemini family of models. Mentioned as a likely early adopter of recursive model architecture internally, suggesting the biggest labs are already quietly integrating these ideas.

"The right answer is to take the amazingness here and take the amazingness here, which probably is already in Gemini already or some of these, it might be at least in some part." — Francois Chaubard 00:35:53

Mistral (Francis Chalet's Company)

European AI lab. A researcher named Konstantin, affiliated with Mistral, performed the critical ablation studies on HRM that identified the outer refinement loop as the key innovation — work that the original paper authors did not do themselves.

"This guy, Constantine at Francis Chalet's company, India, actually did. And it's this amazing breakdown that he posted on YouTube. The main takeaway is that the outer refinement loop is the main reason why these things work so well." — Francois Chaubard 00:21:21

4. People Identified

Alexia (Azealia) — TRM Paper Author

Researcher who authored the Tiny Recursive Models (TRM) paper. She simplified HRM, collapsed the dual network into one, reduced parameters from 28M to 7M, improved ArcPrize 1 performance from 70% to 87%, and identified that backpropagating through one full recursive loop is sufficient. She also disproved the DEQ mathematical justification.

"She figures out that you actually can back prop through all the way to the deep recursion, which actually improves performance much, much more." — Francois Chaubard 00:14:06 "She makes the model three, four times smaller. Because it has that recursion it actually outperforms." — Francois Chaubard 00:33:25

Alex Graves

Pioneering researcher at the intersection of RNNs and adaptive compute, known for Neural Turing Machines, Adaptive Compute Time, and Differentiable Neural Computers. Described as the intellectual predecessor to the HRM/TRM line of work, though limited by backprop through time.

"We were very much in the belief that this was required to get to AGI — peak RNN use probably until 2016 with Alex Graves' NeurIPS keynote, which is just fantastic, and all his adaptive compute time work." — Francois Chaubard 00:00:53

Konstantin (at Mistral)

Independent/Mistral-affiliated researcher who performed the missing ablation studies on the HRM paper and correctly identified the outer refinement loop as the core mechanism. His YouTube breakdown is cited as the clearest analysis of what actually matters in these papers.

"Konstantin does a good job of this... the main takeaway is that the outer refinement loop is the main beneficiary, the main reason why these things work so well." — Francois Chaubard 00:21:21

Melanie Mitchell

Researcher and author who wrote a book capturing a key insight now validated by TRM/HRM: that bigger models and more recursion are each sufficient but not necessary for improved performance — and the real gains come from combining both.

"There's this researcher named Melanie Mitchell that writes this book talking about this very phenomenon which is like, it is sufficient, not necessary to go bigger and get better performance, and it is sufficient and not necessary to add more recursion." — Francois Chaubard 00:33:55

Demis Hassabis

CEO of Google DeepMind. Cited for framing the "Einstein test" as the ultimate AGI benchmark — can a model rediscover all of physics from scratch starting in 1911? This test directly illustrates why chain of thought alone will never reach AGI.

"Demis had this whole thing about like the ultimate test is the Einstein test. Like go back to 1911 and then have it rebuild all the physics up until now." — Francois Chaubard 00:18:33

5. Operating Insights

The "Delete 75% of the First Paper" Rule for Evaluating Research

When evaluating AI research claims — whether for investment, hiring, or product decisions — the first paper in a new paradigm typically contains one real insight buried in complexity. The signal is in what the follow-on paper removes. Operator takeaway: don't bet on the full architecture of Paper 1; wait for the ablation that isolates what actually matters.

"A lot of machine learning, the follow-on paper is basically delete 75% of the first paper as we've often done in videos here. And keep the magic basically." — Ankit Gupta 00:21:46

Train on Hard, Incompressible Problems to Force Genuine Reasoning Capability

When building or evaluating AI systems for reasoning tasks, the benchmark problems matter enormously. Sudoku, ArcPrize, and mazes are cited specifically because they are provably incompressible — you cannot shortcut them. If your model performs on these, it has genuine reasoning ability. If you're building AI-powered products, test on incompressible problems, not ones solvable by pattern matching.

"In HRM and TRM, they use Sudoku as an incompressible problem. Similarly, mazes. Those are incompressible problems. Rolling sum, incompressible problem." — Francois Chaubard 00:05:29

6. Overlooked Insights

The Latent Space Reasoning Combination Is the Actual Next Frontier — and Nobody Is Publicly Building It Yet

This was mentioned almost as an afterthought at the very end, but it is arguably the most important investment/research thesis in the episode. The insight is: LLMs are extraordinary at building rich semantic embedding spaces. TRMs/HRMs are extraordinary at reasoning inside a latent space recursively. Neither alone is sufficient. But no one has yet combined a large pretrained embedding model with a small recursive reasoning module operating in that latent space — and both speakers suggest this is where the real breakthrough will come. This is a specific, actionable, non-obvious architectural thesis that has not yet been productized.

"What you can imagine is we found a mapping from token space, from vision, from pixels, to some really cool latent space where things are just nicely semantically separated. But now in that space, use tiny reasoning models, use some type of recursion inside that, and train a little small model on that reasoning space. I think that's really going to work." — Francois Chaubard 00:37:40

Backprop Through Time Remains the Unresolved Bottleneck for the Combined Architecture

Mentioned briefly and technically, but critically important: even the TRM paper's improved backprop approach is still memory-constrained by backprop through time. This means the full potential of combining large models with deep recursion has not been unlocked — because nobody has solved how to train it efficiently at scale yet. This is a white-space research and infrastructure opportunity.

"Where I'm really excited is what happens if you do both. And you're still limited by backprop through time. Even Alexia is limited by that last step from a memory perspective. And so if you can make the model really big and you have lots of recursion and we do something else other than backprop through time, then we can get all the benefits of this and all the benefits of the giant LLMs. And then you can get some crazy stuff." — Francois Chaubard 00:34:19