Teahose.
SIGN IN
NEW HERE — WHAT TEAHOSE DOES
We read the entire AI & tech firehose — so you don't have to.
PODPodcastsAll-In, No Priors, Acquired…
NEWNewslettersStratechery, Newcomer…
PAPPapersPhysical AI research
PHProduct Huntdaily launches
VCInvestor ScoutSequoia, a16z, Benchmark…
CLAUDE DISTILLS →
7 reads, 30 sec each — free, 6 AM ET.
+ a live graph of the companies, people & themes underneath.
HOME/TRAINING DATA/Memory and Continual Learning: E…
POD
// EPISODE
TRAINING DATA

Memory and Continual Learning: Engram's Dan Biderman and Jessy Lin

DATE June 24, 2026SOURCE TRAINING DATAPARTICIPANTS DAN BIDERMAN, JESSY LIN, SHAUN MAGUIRE, SONYA HUANG
// KEY TAKEAWAYS6 ITEMS
  1. 01The Bottleneck Has Shifted from Intelligence to Contextual Learning
  2. 02Continual Learning via Training Is Categorically Different from RAG or Context Engineering
  3. 03100x Token Reduction is a Near-Term, Concrete Value Proposition
  4. 04The KV Cache is an Enormous Unsolved Inefficiency
  5. 05The Research-Product Loop Must Be Rebuilt for Always-Training Models
  6. 06Personalized Models Will Be Stratified: Everyone Gets Their Own, and They Will Diverge

1. Key Themes

The Bottleneck Has Shifted from Intelligence to Contextual Learning

The core thesis of Engram is that raw model intelligence is no longer the limiting factor for AI utility — it's the ability to learn and deeply internalize new, evolving, private context. The frontier labs are optimizing for a different problem.

"The bottleneck for making these models more useful these days is not really raw intelligence, but understanding new and evolving context... How do you bake that into the model weights the same way that pre-training and post-training bakes that into the model weights very deeply?" — Dan Biderman 00:01:09

Continual Learning via Training Is Categorically Different from RAG or Context Engineering

The team argues that retrieval and context stuffing are fundamentally limited — they are externalized memory that cannot form the kind of abstract associations that internalized weights can. The real unlock is applying frontier-lab-style training pipelines to every private domain.

"An under-leveraged tool these days is using the same kind of training pipeline or framework or kind of workflow that the frontier labs are using to make these models really good at frontier math or code. But applying that to every kind of domain, every kind of context that you have, like let's say in a company." — Dan Biderman 00:02:22

"If you are always doing RAG, you can't make associations like, oh, you know, I see somebody on the team is doing this kind of research. And I kind of recall at an abstract level, oh, there's this related thing that you might want to know about. You didn't even ask about it, right? But these kinds of associations can only happen in weights." — Dan Biderman 00:29:22

100x Token Reduction is a Near-Term, Concrete Value Proposition

The efficiency case for trained-in knowledge versus retrieval isn't marginal — it's potentially orders of magnitude. This is not just a research thesis; it's an immediate enterprise pain point.

"You don't have to research things and reread things and you don't have to write monstrous system prompts... that can give you two orders of magnitude reduction in token inference consumption. It's not like 50% or it's can be 100x fewer tokens because many things, especially things that relate to people and teams and organization and priorities — these are things that you can't really find in one document." — Dan Biderman 00:07:10

The KV Cache is an Enormous Unsolved Inefficiency — and the Compression Opportunity is Massive

A KV cache for a single Wikipedia article balloons to 80GB of GPU memory, while the entire weights of a 70B Llama model are ~100GB and encode the internet. This asymmetry is a core motivation for weight-based memory compression.

"A KV cache for a single Wikipedia article for some Taylor Swift or something like this, it will be like 80 gigabytes of HBM memory on the GPU. And the entire weights of a 70B Llama model would be about 100 gigabytes. And with some distortion, they remember the entire internet... What if we can take those 80 gigabytes, spend some compute offline, then compress it and make it really, really small so that the thing we load in cache is 1,000x smaller?" — Dan Biderman 00:30:16

The Research-Product Loop Must Be Rebuilt for Always-Training Models

Current frontier lab structure — researchers train a model, throw it over the fence to product — breaks down when the model is continuously learning from user interactions. A new tightly integrated loop is required.

"In this world where the models are always training, the inputs that users provide are very intricately tied to what the models learn from, like what the training signal is. And so there needs to be a lot more of a kind of integrated loop between research and product." — Dan Biderman 00:18:13

Personalized Models Will Be Stratified: Everyone Gets Their Own, and They Will Diverge

The long-term vision is not one big model everyone uses, but hundreds of millions of personalized models — for individuals, teams, companies — that differ meaningfully from each other and from the frontier.

"I'm imagining a world where everyone has their own model that is really different from the other person's model and from the frontier model... Whether it's an individual or a team, I think there's an element of like having different kinds of intelligence everywhere." — Dan Biderman 00:42:52

The Problem of "What to Know" Is Unsolved — and That's the Core Research Frontier

Neither biology nor AI has a principled answer to what should be internalized versus externalized. Engram is betting on learning the signal with as few heuristics as possible, similar to how the brain manages noisy input.

"I think what people are worried about these days is — what are the right things to store? It's an unsolved problem. I don't think anyone has answered it. We're all working on it. It's also the fundamental question of biological memory... As humans, we watch TikTok and get exposed to a lot of garbage. And still the brain is able to learn and not completely go off the rails. And we think models should be the same as well." — Dan Biderman 00:27:06

Demis Hassabis Publicly Called Out Memory and Continual Learning as Requiring New Breakthroughs

This is not a fringe academic view — the CEO of Google DeepMind validated the thesis publicly.

"Demis at the Sequoia event about a month ago said pretty clearly that we need new breakthroughs around these topics. And obviously they're thinking about them. We're just focusing exclusively on this." — Dan Biderman 00:17:23


2. Contrarian Perspectives

The Bitter Lesson Applied to Memory Means You Should Burn More Compute — Not Engineer Smarter Retrieval

Most practitioners respond to memory challenges with smarter retrieval architectures. Engram's contrarian view is that the bitter lesson demands the opposite: throw more training compute at the problem.

"If you're really bitter-lesson-pilled, what do you want to do is you want to think, how can I burn more compute? And how can I burn it on new context that I have not seen before? We are not betting that the overall direction of AGI is going to end anywhere soon. We just think there's more compute to scale. And if I truly want to understand Sean and Sean's work and Sean's context, just rereading files is not going to make it." — Dan Biderman 00:24:15

Separating Facts from Skills in Model Weights Is a False Dichotomy

A popular view is that models should stop memorizing facts and instead only learn reasoning. Engram argues this is both unnatural and wrong — you cannot think complex thoughts without internalized building blocks.

"If you need to recall basic facts in order to take the next step in your thinking, you can't get very far... In order to think more and more complex and deep thoughts about things, you kind of need to internalize something so that you can compose them into more abstract concepts." — Dan Biderman 00:10:36

Spending Time Fine-Tuning Your Own Model Beats Context Engineering — But Nobody Does It

The common behavior is context engineering — better prompts, smarter RAG. Engram's contrarian claim is that if you actually trained your own model, you'd get compounding returns over time — and the current paradigm structurally prevents individuals from accessing this.

"If you resign from your job today and your sole mission was to make a model that's better for you, and you would use OpenAI and Anthropic and all these frontier models, and you just 24-7 engineer the context... your way to move the needle is very limited as an individual. You'll just be better off waiting for the next version of the model... We would like to see a future where actually the more time you spend on the thing actually translates to the quality of performance." — Dan Biderman 00:32:32

Knowing What to Search for Is Itself the Hard Problem — RAG Assumes the Query

The implicit assumption in retrieval-augmented generation is that the model knows what it's looking for. Engram identifies this as the overlooked flaw: intuition about where to look must live in weights, not in the retrieval layer.

"The main limitation with retrieval systems in general is the problem is not so much what to store and where to put it. The problem is how to address it, like how to query the thing. Do you know what to look for even?... Knowing what to search is something that's intuitive and can happen in the weights." — Dan Biderman 00:29:32


3. Companies Identified

Engram

AI research lab focused on memory and continual learning for language models. Building per-team, per-individual fine-tuned adapters that allow models to continuously learn private organizational context, reducing inference token costs by up to 100x while improving task relevance. Partners include Notion, Microsoft, and Harvey.

"We're working with partners like Notion and Microsoft and Harvey that have these places where people are doing a lot of work over a long period of time... We're training per-team models within these workspaces that deeply understand those contexts and can improve with time on the things that people care about." — Dan Biderman 00:03:35

Notion

Collaborative workspace product. Named as an active Engram integration partner.

"We're working with partners like Notion and Microsoft and Harvey." — Dan Biderman 00:03:35

Microsoft

Enterprise software giant. Named as an active Engram integration partner.

"We're working with partners like Notion and Microsoft and Harvey." — Dan Biderman 00:03:35

Harvey

AI-native legal platform. Named as an active Engram integration partner.

"We're working with partners like Notion and Microsoft and Harvey." — Dan Biderman 00:03:35

Databricks

Data and AI platform. Cited as a comparable company to Engram's long-term vision — building the neural interface to the enterprise data plane.

"Sharing some similarities to great companies like Databricks and Oracle where we form these memories that happen to be neural memories with models that happen to be personalized." — Dan Biderman 00:43:48

Oracle

Enterprise database company. Cited alongside Databricks as a structural analogue to Engram's data plane vision.

"Sharing some similarities to great companies like Databricks and Oracle." — Dan Biderman 00:43:48

OpenAI

Frontier AI lab. Referenced as a provider whose models Engram works with, and whose ChatGPT memory product is noted as a flawed early reference point for consumer memory.

"If you resign from your job today... and you used OpenAI and Anthropic and all these frontier models, and you just 24-7 engineer the context right... your way to move the needle is very limited as an individual." — Dan Biderman 00:32:32

Anthropic

Frontier AI lab. Referenced as a provider whose models Engram works with; interpretability research team noted as doing important work disentangling facts from algorithms in weights.

"We need all these smart people and Anthropic interpretability to try and break them apart." — Dan Biderman 00:11:30

Fireworks AI

Inference platform. Mentioned as a context for Sonya Huang's affiliation; also noted as a potential partner for offline compute compression of KV caches.

"Sonya works with Fireworks. She really loves Zyan." — Jessy Lin 00:28:20

Mosaic (MosaicML)

AI training platform (acquired by Databricks). Referenced as a place where founders went to learn how LLM training works.

"Anyone else who's seen the ChatGPT moment and went to do some work at Mosaic and stuff like that to learn how the sausage is made on the NLP side." — Dan Biderman 00:37:50

GitHub Copilot

AI code completion tool by GitHub/Microsoft. Named as one of the two genuine inflection-point moments in AI product development alongside ChatGPT.

"To me, the main events were GitHub Copilot. That for me was just the main event and ChatGPT." — Dan Biderman 00:33:56


4. People Identified

Dan Biderman

Co-founder of Engram. Background in neuroscience, with deep interest in what makes pre-training generalize. Leads the core research and product architecture direction at Engram.

"What about pre-training or even post-training makes it possible for the models to generalize in these magical emergent ways and controlling that process so that a company has a set of private data — how do we make the models learn that just as well as the models know like the capital of France?" — Dan Biderman 00:00:00

Jessy Lin

Co-founder of Engram. Background in cognitive and computational science, started PhD at Stanford in the SaaS era in 2007. Brings a systems-level view on model architecture and the vision/language split in AI progress.

"I started a PhD in the SaaS era in 2007 at Stanford. And AI was boring as hell at the time... In 2012, AlexNet happened. Vision was dominating for six years or whatever. Are you guys surprised that the language approach seems to be dominating over vision in progress?" — Jessy Lin 00:36:08

Demis Hassabis

CEO of Google DeepMind. Cited as publicly endorsing the view that memory and continual learning require new research breakthroughs — validating Engram's thesis from the highest level of the field.

"Demis at the Sequoia event about a month ago said pretty clearly that we need new breakthroughs around these topics." — Dan Biderman 00:17:23

Amos Tversky

Israeli cognitive psychologist and behavioral economist, co-developer of prospect theory with Daniel Kahneman. Cited as an intellectual inspiration for studying natural cognition rather than artificial intelligence.

"As Amos Tversky, the Israeli psychologist used to say, he's not interested in artificial intelligence. He's interested in natural stupidity. So I would say I started similarly trying to see how people and animals experience the world." — Dan Biderman 00:20:18


5. Operating Insights

Train Your Way to Context, Don't Prompt Your Way To It

For enterprise AI deployments, the highest-leverage investment is building lightweight training infrastructure on proprietary data — not better prompts or larger context windows. The compounding returns from training accrue over time; context engineering has a hard ceiling.

"It's obvious for anyone who's trained models that there is a superior way to integrate across the ideas and capabilities. And it involves this kind of magic of training... We are clear that this has to happen in those high-stake domains of math and coding and cyber. We just think much of this magic can actually end up in the hands of many more people in interesting ways." — Dan Biderman 00:14:48

Identify the Repeated Queries Across Your Team — Those Are Your First Training Targets

The immediate, tractable ROI from continual learning is eliminating the same documents being re-read and the same queries being re-run by different employees. That's a direct, measurable inference cost reduction and a clear signal for what to train on first.

"Across people in the same company, they're running the same queries on the same documents over and over again. And that should be something the model just knows. Like in the same way you ask an employee, they don't type into the search box, 'what was I working on yesterday?' They just know." — Dan Biderman 00:28:28

Build the Feedback Loop Between Users and Training Into the Product From Day One

Product teams that treat the model as a static artifact and optimize purely at the prompt layer are structurally disadvantaged. The winning architecture integrates what users do into what the model learns — immediately.

"In this world where the models are always training, the inputs that users provide are very intricately tied to what the models learn from, like what the training signal is. And so there needs to be a lot more of a kind of integrated loop between research and product." — Dan Biderman 00:18:13


6. Overlooked Insights

The "Skills You Take With You" Problem Will Create a New Class of Personal AI Asset

This was mentioned almost in passing, but it has enormous economic and legal implications. If a model trained on your work at one company encodes your skills and ways of doing things, the question of what you can "carry" when you leave becomes a new kind of IP dispute — and potentially a new kind of personal asset class that incentivizes deeper AI adoption.

"I think a holy grail is like you go to work and you just burn through all these tokens and you create all this value. And somehow all the IP and stuff stays with the company. But somehow the skills you learned, the things you invented, your ways of doing things — some of them you can take with you as well to your next job in a way that's sanitized and not harmful to any other company's IP... I think doing it in the digital world would be pretty interesting and pretty rewarding because it will force each of us to push the frontier and implement AI more deeply in our companies and our individual life and then be rewarded for it." — Dan Biderman 00:35:23

The 3-6 Month Capability Gap Is a Recurring, Reliable Commercial Window

Buried in a sentence about inference costs is a strategically important observation: there is a consistent window of 3-6 months during which bespoke tasks are not yet well-served by the frontier model, but can be solved by a fine-tuned specialist model. This isn't a one-time arbitrage — it's a structural recurring gap as new task types continuously emerge, and it's a durable commercial moat for any company that can close that gap with fast training.

"We kind of think there's going to be consistently this gap of like three to six months ahead where there's certain things that are bespoke that people are just exploring. The models are not fully great for them. The models will at some point be great for them. But if you can autonomously learn in a very lightweight way, it will give value in that time in terms of capabilities." — Dan Biderman 00:07:35