Eric Jang – Building AlphaGo from scratch
- 01Neural Networks as Simulation Compressors
- 02The MCTS Policy Distillation Loop as the Core of Self-Improvement
- 03High Variance is the Core Problem in RL, and Credit Assignment is the Solution
1. Key Themes
Neural Networks as Simulation Compressors
The most profound insight from AlphaGo is that a relatively shallow neural network (10 layers) can approximate what would otherwise be a computationally intractable search problem. Eric argues this has implications far beyond Go.
"A 10-layer neural network pass... 10 steps of reasoning... is able to amortize and approximate to a very high fidelity a nearly intractable search problem. This was a breakthrough that I think most people don't even understand today, fully comprehend how profound that accomplishment is." 01:17:20
"This is what also girds AlphaFold for example, where you have a difficult physical simulation process that you would need to roll out so many micro-scale simulations and yet 10 steps of somewhat small neural network can somehow capture what feels like an NP-class problem." 01:17:50
The MCTS Policy Distillation Loop as the Core of Self-Improvement
AlphaGo's training genius is a flywheel: search produces better moves than the raw policy, those better moves become training targets, and the policy internalizes the search — allowing each new round of search to start from a stronger baseline. This is a general principle applicable beyond Go.
"The beauty of how AlphaGo trains itself is that it actually can take this final search process, the outcome of the search process and tell the policy network, hey, like you know, instead of having MCTS do all this legwork to arrive here, why don't you just predict that from the get-go." 01:02:59
"It's almost like if you could just amortize the first 1,000 steps actually into the policy network instead of the search process, then you can begin at a much better starting point and then get a much better result for the number of sims that you play." 01:04:27
High Variance is the Core Problem in RL, and Credit Assignment is the Solution
Whether in AlphaGo or modern LLM RL, the fundamental bottleneck is variance in the learning signal. The difference between good and bad RL algorithms is essentially how well they solve credit assignment — attributing which specific actions caused a win.
"You only have one label out of this enormous data set of actions of supervision actions... the scale of your variance is actually very bad." 01:28:17
"You want to reduce variance by trying to make this smaller... ideally what you — they call this advantage — there are multiple ways to compute it." 01:36:22
2. Contrarian Perspectives
AlphaGo Is More Profound, Not Less, the More You Understand It
Dwarkesh pushes the intuition that AlphaGo seems less impressive once you understand the explicit engineering that went into the tree search scaffolding. Eric firmly pushes back — the more profound point is that simple neural architectures can compress almost-intractable problems.
"I personally disagree. I think they're profound for different reasons... by construction let's say a 10-layer neural network can only do 10 sequential steps of thinking right, 10 steps of neural network paralyzed distributed representation thinking is able to amortize and approximate to a very high fidelity a nearly intractable search problem." 01:16:50
NP-Hard Problems May Not Be as Hard as We Think
Eric suggests that the success of AlphaGo, AlphaFold, and AlphaTensor implies our theoretical understanding of computational hardness may be fundamentally incomplete. This is not a casual claim.
"It actually makes me wonder if our understanding of problems like P equals NP or these very fundamental computational hardness problems are incomplete right. It's not like obviously this is not proof of P or anything, but there's something to it that is very disturbing where what felt like a very hard problem can fall to a very simple macroscopic simulation." 01:18:19
"NP problems have been formulated solutions to NP hard problems as worst case complexity and I wouldn't say this solves go it doesn't give us an exact solution of the optimum but in practice it is extremely useful and the same thing has been shown in AlphaTensor, AlphaFold." 01:19:14
ResNets Still Outperform Transformers in Low-Data, Spatially-Structured Regimes
Counter to the prevailing assumption that transformers are universally superior, Eric finds that for Go specifically (and likely similar board-game-like spatial problems), ResNets offer better bang-for-the-buck at lower compute budgets.
"For small data regimes, my experience is that ResNets still kind of outperform transformers and kind of give you more bang for the buck at lower budgets... They provide the inductive bias of, like, local convolutions. And generally, transformers start to outperform residual convolutional networks when you want more global context." 01:33:05
MCTS Is Not Guaranteed to Improve on the Policy — and Most Practitioners Don't Know When It Fails
Eric explains a non-obvious failure mode: if the value function is inaccurate (e.g., because the replay buffer lacks late-game states), MCTS can actually degrade policy quality.
"If your terminal values of the leaves are not good then this will actually propagate all the way up and cause your puck selection criteria and your backups to be off and then you end up visiting a very, very different distribution than what your policy initially recommended." 01:09:08
"That's why it's not a guarantee to improve and that is why I suspect why AlphaGo Lee had the playouts to the end in their training algorithm so they could ground this thing in real playouts." 01:10:08
3. Companies Identified
KataGo
- Description: Open-source Go AI project
- Why Mentioned: Achieved a 40x reduction in compute required to train a state-of-the-art Go AI, relative to earlier systems. Now the standard AI that serious Go practitioners train against. Also introduced architectural innovations like global feature pooling that improved training efficiency.
- Quote: "In 2020, there was an open source project called Katago by David Wu from Jane Street, who basically achieved a 40x reduction in compute needed to train a really strong GoBot tabula rasa... This is what most Go practitioners today train against when they're playing an AI." 01:31:59
Jane Street
- Description: Quantitative trading firm
- Why Mentioned: David Wu, creator of KataGo, was from Jane Street. Also featured as podcast sponsor with a detailed data center tour, showing their infrastructure evolution from six Dell boxes to liquid-cooled GB300 cabinets consuming 140 KWE per rack.
- Quote: "These cabinets, these GB300 cabinets consume at peak about 140 KWE. Compare that to traditional air cooled you're talking about 10 to 40 KWE." 01:00:08
4. People Identified
David Wu
- Description: Researcher at Jane Street, creator of KataGo
- Why Mentioned: Achieved a 40x efficiency improvement in training strong Go AIs. His open-source work democratized Go AI to the point where what cost millions at DeepMind can now be replicated for thousands of dollars.
- Quote: "There was an open source project called Katago by David Wu from Jane Street, who basically achieved a 40x reduction in compute needed to train a really strong GoBot tabula rasa." 01:31:59
John Schulman
- Description: AI researcher, creator of PPO and general advantage estimation
- Why Mentioned: Eric specifically recommends his general advantage estimation paper as the authoritative treatment on reducing variance in RL gradient estimation.
- Quote: "I highly recommend John Schulman's general advantage estimation paper as like a good treatment on how to think about various ways to compute it." 01:36:51
Andrej Karpathy
- Description: AI researcher and educator
- Why Mentioned: Eric references Karpathy-style auto-research hyperparameter tuning as a practical approach to architecture search. Also Dwarkesh references Karpathy's framing of RL as "sucking supervision through a straw."
- Quote: "You can also use kind of a Karpathy-style auto-research hyperparameter tuning to make your architecture pretty good." 01:37:53
5. Operating Insights
Always Initialize Research Projects Close to the Answer Before Attempting Tabula Rasa
Eric offers a strongly held research philosophy: start experiments from the best available initialization (e.g., supervised on expert data) rather than trying to solve everything end-to-end from scratch. This is directly applicable to any ML team or research organization.
"You generally want to kind of initialize, just as in deep learning, initialization is everything, right? You always want to initialize your research project to something as close to success as possible, especially if you're doing something new that you haven't done before. Like, always pick something that works and then get it to do something better rather than start from something that doesn't work at all and then try to make it work." 01:39:47
Ground Your Value Functions in Real Outcomes Before Trusting Search
A concrete operational lesson: before investing compute in MCTS or search-based improvement, ensure your value function is well-calibrated on terminal states. A fast practical hack: prevent agents from resigning in 10% of games to generate late-game training data.
"In practice what you could also do is just like for 10% of the games you prevent the bots from resigning and you just say like resolve it to the end so you get some training data in your replay buffer to really resolve those kind of late stage playouts that normal human players would kind of not play to." 01:10:08
Transfer Learning Across Problem Sizes Is Underutilized
KataGo's insight that training on smaller board sizes (9x9) transfers meaningfully to larger ones (19x19) suggests a general principle: when training costs are high, consider whether a smaller-scale version of the problem can bootstrap learning efficiently.
"If you train a model that can train on both 9x9 and 19x9 data and KataGo proposed one of these architectures then there's some pretty good transfer learning from the value head evaluated at 9x9 to the 19x19." 01:13:54
6. Overlooked Insights
The Democratization of Frontier AI Research Is Already Here — and Accelerating
Eric mentions almost in passing that what required a full DeepMind research team and millions in compute can now be replicated for a few thousand dollars using LLM-assisted coding. This is not just a curiosity — it signals that the barrier to frontier-adjacent AI research has collapsed, with major implications for who will produce the next generation of AI breakthroughs (individuals and small teams, not just large labs).
"Thanks to LLM coding, what took a whole team of research scientists at DeepMind and, you know, millions of dollars of research and compute can now be done for, you know, a few thousand dollars of rented compute." 01:59:59
This pairs with Eric's sabbatical project itself as a proof point — a single researcher reproducing and extending AlphaGo alone, aided by Claude for implementation. The compounding effect of AI-assisted research on small, high-talent teams is likely to be one of the most important structural shifts in how scientific and technical progress happens over the next decade.
Macroscopic Predictability as a General Principle for Hard Problems
Almost as an aside in a philosophical tangent, Eric articulates a principle that is deeply non-obvious: even in systems that are chaotic at the micro level (exact board states in Go, exact storm trajectories in weather), macroscopic quantities (who will win, where the hurricane will go) can remain predictable and learnable. This is the actual reason neural networks can solve "hard" problems — and it implies that many domains currently considered intractable may have learnable macroscopic structure waiting to be discovered.
"We don't necessarily care about the micro scale things, we actually care about the macroscopic structure and these things can be predictable." 01:22:23
"In weather it could be the same thing right like we don't exactly care what the velocity of wind 6,000 feet above a specific latitude longitude is, we kind of care like where's the hurricane or things like that... you don't know where you're going to end up but you do know that the thing looks like this." 01:21:53
This reframes the question for investors and researchers alike: the opportunity is not in solving micro-level complexity, but in identifying domains where a macroscopic signal exists and applying neural networks to learn it. Areas like drug discovery, materials science, economic forecasting, and climate modeling may all have untapped macroscopic predictability that has not yet been exploited.