Teahose.
SIGN IN
NEW HERE — WHAT TEAHOSE DOES
We read the entire AI & tech firehose — so you don't have to.
PODPodcastsAll-In, No Priors, Acquired…
NEWNewslettersStratechery, Newcomer…
PAPPapersPhysical AI research
PHProduct Huntdaily launches
VCInvestor ScoutSequoia, a16z, Benchmark…
CLAUDE DISTILLS →
7 reads, 30 sec each — free, 6 AM ET.
+ a live graph of the companies, people & themes underneath.
HOME/DWARKESH/The next big breakthrough will b…
POD
// EPISODE
DWARKESH

The next big breakthrough will be AIs learning on the job

DATE June 26, 2026SOURCE DWARKESHPARTICIPANTS DARIO AMODEI, DWARKESH PATEL
// KEY TAKEAWAYS6 ITEMS
  1. 01RLVR Is the Current Paradigm Bet
  2. 02Verifiability Alone Is Not Enough
  3. 03The Inference Compute Waste Problem
  4. 04Sample Efficiency and Continual Learning Are the Same Problem
  5. 05On-Policy Self-Distillation (OPSD) as a Superior Alternative to Naive RL or SFT
  6. 06"Dreaming" as a Speculative Fourth Scaling Axis

1. Key Themes

RLVR Is the Current Paradigm Bet — But Its Generalization Limits Are Underappreciated

The labs are collectively wagering that training on millions of verifiable, containerized tasks will produce general intelligence. But the host raises a pointed question about whether short-horizon RL training actually generalizes to long-horizon real-world performance.

"The labs are betting that RLVR will generalize. That is that if you train on enough containerized, reproducible environments, you will develop a very general agent that can make and execute plans and learn rapidly from new information, and even pick up new skills, all within a single session." 00:05:56

"Maybe I'm reading too much into this, but it seems like he's saying that short horizon RL training doesn't necessarily generalize to long horizon RL performance." 00:07:16

Verifiability Alone Is Not Enough — Domains Must Also Be "Grindable"

One of the most structurally important insights in the episode: the reason computer use lags coding and math is not just data quality, but the inability to run thousands of parallel, deterministic rollouts from identical starting states.

"It is not enough for a domain to be verifiable. It also has to be very grindable in the sense that you have to be able to run lots of parallel rollouts against a deterministic and replayable simulator. And you had to run those rollouts from the same starting point." 00:03:08

"You can't just have a thousand agents go try the same checkout flow on Amazon to get better at using websites because Andy Jassy will find your bots and shut your ass down." 00:03:38

The Inference Compute Waste Problem

Roughly 30–50% of a lab's compute goes to inference — and currently none of that compute is feeding back into model improvement. This is framed as a systemic inefficiency that compounds over time.

"Around 30 to 50% of a lab's compute goes to inference. And that compute is currently not playing any productive role in helping improve the model. This seems like a huge waste. And it's even worse than it sounds, because it is only in deployment that the most valuable bits of information, which your model could learn from, are actually revealed." 00:07:44

Sample Efficiency and Continual Learning Are the Same Problem

The essay makes the non-obvious point that these two research problems, typically discussed separately, are actually deeply coupled. In-context learning is sample efficient but doesn't scale memory-wise; weight updates are memory efficient but require massive sample counts.

"Sample efficiency and continual learning are actually deeply connected problems. Relatively little data is available to the model on the job. Now, to learn from this data requires sample efficiency, and models can do that in context, but using the fast weights that are built on the fly by attention, which allows for the sample efficiency, but scales very poorly in terms of memory." 00:11:34

On-Policy Self-Distillation (OPSD) as a Superior Alternative to Naive RL or SFT

OPSD is presented as a technically superior method for continual learning because it avoids needing an outer-loop verifiable reward, provides denser per-token supervision signal, and surgically updates only the parameters necessary.

"We encourage the base model to make the same predictions when trying to solve some real-world problem as the model with all the contexts accumulated after a long session would have made. The whole point of this procedure is to distill what the model learned in a session back into the weights themselves." 00:13:00

"OPSD provides a much denser supervision signal than naive RL. Instead of projecting a single reward through the whole trajectory, you can train on the per-token probability discrepancy between the teacher and student." 00:13:28

"Dreaming" as a Speculative Fourth Scaling Axis

Beyond pre-training, RL, and inference-time compute, the host proposes a fourth axis: models spending compute to build their own RL environments and train against them — essentially a world-model-based rehearsal mechanism analogous to EfficientZero.

"If the AI can build a good simulation of reality against which to rehearse new skills or try alternative strategies and reinforce what actually works, then AIs could experience orders of magnitude more simulated samples in the same wall clock time." 00:15:25

"This would become a fourth axis of scaling alongside pre-training, RL, and inference time compute. You could call it test time training or dreaming." 00:16:25

The 2027–2028 Continual Learning Scenario

A specific near-term trajectory is sketched where RLVR produces a competent-enough agent to deploy, expanded context enables week-long co-working sessions, and end-of-session feedback distills learnings back into the base model — progressively expanding AI capability into non-verifiable domains.

"At the end of a week, you give it a thumbs up or a thumbs down. You give it a work review. And if you give it a thumbs up, the base model distills everything that the AI learned during the session... the gamut of AI skills and knowledge and capabilities can expand far beyond the verifiable domains that the model was originally trained against before it was deployed." 00:17:53


2. Contrarian Perspectives

Continual Learning Into Weights May Matter More Than Infinite Context

The prevailing optimistic framing is that sufficiently long context windows make weight-updating continual learning unnecessary. The host rejects this, arguing it's both architecturally unscalable and neurologically backward.

"AIs can't just keep building up a bigger and bigger KV cache as they learn from more and more users. That's just not scalable. And that's also not how humans do it... When we learn stuff, there's clearly some kind of compression. And this aids our generalization and grokking." 00:08:41

RL Learning Less Per Sample Is Actually a Feature, Not a Bug

Most researchers treat RL's sample inefficiency as a liability. The host flips this, arguing that for continual learning specifically, the surgical nature of RL weight updates is precisely what you want — it prevents catastrophic forgetting.

"I wrote a post a few months earlier arguing that RL learns much less information per sample than supervised learning. But this may be a good thing rather than a bad thing. You only change the model as much as it is absolutely necessary to achieve the outcome and no more." 00:14:26

The Most Valuable Training Data Is Being Thrown Away Right Now

The current deployment model treats inference as terminal — no learning flows back. The host argues this is a fundamental architectural mistake, not just an optimization gap, because deployment is the only place where the rarest, most valuable signal exists.

"It is only in deployment that the most valuable bits of information, which your model could learn from, are actually revealed. Things like, what's actually happening in the organizations where I'm being used? And what are they using me for? And what kinds of mistakes do I tend to make in the real world?" 00:07:44

The Real Ceiling on AI Progress Is Not Algorithmic or Compute — It's Environmental Replayability

Most scaling discourse focuses on compute and data quantity. The host argues the binding constraint for many domains is whether you can construct a replayable simulator at all — and for the most important human skills (politics, business-building, market trading), you simply cannot.

"How do we train an AI to get really good at building a business from scratch? How about winning court cases or having a profitable day of trading in the markets or helping a candidate win an election? The rollout here requires interacting with the real world and you can't recreate it from just within the data center." 00:04:31


3. Companies Identified

Cursor

AI-powered coding environment with a tab completion model that does live online learning. Mentioned as a rare working example of production online learning — the cursor tab model learns which edits users actually accept across 400 million requests per day.

"The cursor tab model online learns by predicting the same exact objective for over 400 million requests a day. The objective here being which edits actually got accepted by the user." 00:09:38

Mercury

Business banking and fintech platform that automates invoice processing and payment drafting. Mentioned as a sponsor and operational tool; highlighted for automating invoice scanning, data extraction, and draft payment creation from a dedicated email address.

"Mercury automatically downloads it, scans it, and extracts all the relevant information. Things like the contractor name, address, payment amount, invoice number, and due date, and then uses all of this to create a draft payment." 00:11:06


4. People Identified

Dario Amodei

CEO and co-founder of Anthropic, maker of the Claude models. Quoted directly on the distinction between training context length and serving context length — used as evidence that short-horizon RL training may not trivially generalize to long-horizon performance.

"There's two things. There's the context length you train at, and there's a context length that you serve at. If you train at a small context length and then try to serve at a long context length, like maybe you get these degradations." 00:07:05

Sasha Rush

ML researcher, professor at Cornell Tech, known for work on NLP and efficient transformers. Mentioned as a collaborator on an informal lecture about on-policy self-distillation, lending credibility to the technique as an active area of serious research.

"I recorded a little impromptu blackboard lecture on my iPhone with Sasha Rush a couple of weeks ago, and it's in the link in the description." 00:13:00


5. Operating Insights

Use a Dedicated Invoice Email Address to Eliminate Billing Overhead

A tactically useful operational tip for small and growing organizations: routing all contractor invoices to a single Mercury-connected email address automates extraction, draft creation, and review — eliminating manual inbox archaeology entirely.

"I just give everybody an email address that goes straight to Mercury... Mercury automatically downloads it, scans it, and extracts all the relevant information... and then uses all of this to create a draft payment. Mercury then stores a list of these drafts for me to review." 00:11:06

The "Thumbs Up / Work Review" Feedback Loop as an AI Management Protocol

For teams deploying AI agents on multi-day tasks, the host sketches a concrete management cadence: let the AI work for a full week, then deliver a structured end-of-session review. This framing suggests operators should be designing feedback rituals into AI workflows now, not just task prompts.

"At the end of a week, you give it a thumbs up or a thumbs down. You give it a work review. And if you give it a thumbs up, the base model distills everything that the AI learned during the session." 00:17:53


6. Overlooked Insights

Getting AIs to Rebuild Real Applications from Scratch Is Both a Computer Use Solution and a Coding RL Objective Simultaneously

This was dropped in a single sentence but is actually a significant strategic insight for anyone building AI training infrastructure. The act of having AIs clone Slack, Gmail, and other web applications to create replayable computer-use simulators is also, simultaneously, a high-quality RL training environment for coding. One investment in environment-building pays dividends across two of the hardest frontier problems at once.

"Once AIs get good enough at coding themselves to build these clones with extremely high fidelity, then I'm sure the computer use will make quicker progress than it is right now. And you're also killing two birds with one stone with this kind of procedure because getting AIs to rebuild whole applications from scratch is also a great RL objective for coding." 00:03:38

EfficientZero's Architecture Is the Hidden Predecessor to "Dreaming" — and Nobody Is Talking About It in the LLM Context

The EfficientZero result — that a model playing simulated games in its own head could match a human given only two hours of real game time — is presented briefly as historical context, but its implication is enormous: there is already a working proof-of-concept that world-model-based internal rehearsal can substitute for real-world sample volume. This precedent is almost never cited in mainstream LLM scaling discourse, yet it directly maps onto the proposed fourth scaling axis.

"If this model and a human both got two hours to play against a simulator of an Atari game that they hadn't seen before, this model would actually probably beat the novice human... for each step in the real game, EfficientZero is playing dozens of simulated games in its head. In a similar way, future LLMs might be able to consume far less real-world data while practicing endlessly against environments that they build for themselves." 00:15:55