Why Hardware-Software Co-Design Is AI's Real 100x: Dylan Patel of SemiAnalysis
- 01Hardware-Software Co-Design Is the Real Multiplier, Not Individual Layer Improvements
- 02Model-Hardware Architecture Is Already Diverging, Making Chip Choices Lock-In Decisions
- 03The "CUDA Moat" Is Real but Misunderstood
- 04Inference Benchmarking Must Be Continuous and Pareto-Optimal, Not Point-in-Time
- 05AI Model Cost Has Fallen ~60x Per Year on Equivalent Benchmark Quality
- 06Jensen Huang Is Deliberately Engineering a Multipolar AI World to Prevent Hyperscaler Dominance
1. Key Themes
Hardware-Software Co-Design Is the Real Multiplier, Not Individual Layer Improvements
The conventional framing of AI progress as coming from hardware, software, or model improvements separately misses the point. The biggest gains come when all three layers are co-optimized together, producing non-linear returns.
"If you look at the architecture of any of these models, but DeepSeek is the most famous one... the shapes of all the experts in DeepSeek V3 were all optimized for Hopper. And if you look at for V4, they're optimized for Blackwell and Huawei's chip... I think a lot of this co-optimization is the most important thing. It's called software hardware co-design. And that's what's really exciting... you take what could have been a 2X here, 2X here, 2X here. And instead of being multiplicative to 8X, it's actually 100X because you've optimized across all three layers." — Dylan Patel 00:28:25
Model-Hardware Architecture Is Already Diverging, Making Chip Choices Lock-In Decisions
OpenAI and Anthropic are converging on fundamentally different model architectures, which means their hardware choices are also diverging — and running one lab's model on the other's preferred chip yields meaningfully worse performance.
"The way OpenAI's models are headed, it would be a terrible decision for them to use TPUs potentially. And the way that Anthropic and Google's models are headed, it's actually a terrible decision potentially for them to train with GPUs... OpenAI's are much more sparse and that has benefits. And then Anthropic's are, you know, they're still sparse, but more dense in general." — Dylan Patel 00:34:36
The "CUDA Moat" Is Real but Misunderstood — It's Actually an Ecosystem Shape Problem
The moat is not about CUDA the language; it's that every major Chinese and Western open-source model is co-designed for NVIDIA GPU shapes, making them run poorly on other hardware — a structural advantage that has nothing to do with programming difficulty.
"What people call the CUDA moat is not actually anything to do with CUDA. But it's like the fact that DeepSeek, Kimi, Zhipu AI, Alibaba, Tencent, Xiaomi — their models are a co-design for GPUs. And therefore, if I want to run them on TPUs, actually, in some cases they don't run really well on TPUs... It's that the downstream product is more optimized for NVIDIA." — Dylan Patel 00:36:34
Inference Benchmarking Must Be Continuous and Pareto-Optimal, Not Point-in-Time
The standard of comparing a competitor's suboptimal configuration against one's own optimal configuration is pervasive and misleading. The throughput vs. interactivity curve is the single most important curve in AI infrastructure.
"A lot of times when people are comparing inference performance, they're taking a suboptimal curve or point for someone else and comparing it to their optimal one. It's like, well, yeah, I can make — if I drove a Porsche versus like some race car driver, obviously I'd drive it slower... Most things in hardware, infrastructure, model, application layer, everything is downstream of that curve." — Dylan Patel 00:17:33
AI Model Cost Has Fallen ~60x Per Year on Equivalent Benchmark Quality
This is a staggeringly fast deflationary curve driven simultaneously by hardware, software, and model improvements — and it is accelerating, not decelerating.
"We've seen model costs drop for equivalent quality by like 60x a year. It's incredible... It's been a 60X cost decrease for same benchmark level. We've also seen the same on intelligence per watt. It's not been exactly 60X. It's been closer to like 40X." — Dylan Patel 00:16:08
Jensen Huang Is Deliberately Engineering a Multipolar AI World to Prevent Hyperscaler Dominance
NVIDIA's aggressive investment and promotion of NeoLabs and NeoClouds is a calculated strategic move, not charity — a world where only hyperscalers build compute and only a few closed labs ship models is an existential threat to NVIDIA's business.
"Jensen absolutely hates a world where all the hyperscalers have all the power. There's a reason he's blowing money on random AI labs... he wants to create a multipolar world. That's why he loves Chinese labs... A world where OpenAI and Anthropic and Google models are the only models is one in which he's screwed. A world in which the hyperscalers are the only ones building compute is one he's screwed." — Dylan Patel 00:07:07
Anthropic Is Already Profitable and Approaching GAAP Profitability
This is not widely understood in the market. The per-token gross margin on flagship models is extremely high, allowing Anthropic to pay above-market rates for compute and still expand margins.
"Anthropic in Q2 is profitable. They're net income profitable, excluding stock-based compensation. And I think by Q3 they may even be profitable, including stock-based compensation. That's like how profitable they're getting and their margins on an Opus token, at least Opus 4.8 token, is like north of 80% for the API price." — Dylan Patel 00:52:31
Space Will Capture the Majority of Incremental Compute Buildout by 2040
While dismissed as near-term, the long-arc thesis for orbital compute is strong — driven by terrestrial power constraints and the sheer scale of compute demand forecast.
"By 2030, just OpenAI and Anthropic will have over 100 gigawatts combined... By 2040, it'll be terawatts... If you look at 2040, I think probably more than half of incremental compute will be going in space. But if you look at 2030, I think it's sub 1%." — Dylan Patel 00:21:27
NeoCloud Outperformance Over Hyperscalers Is Structural, Not Accidental
GPU compute at NeoClouds like CoreWeave objectively outperforms hyperscalers on AI workloads because the traditional cloud optimizations — Nitro NICs, tenant isolation, hypervisor overhead — actively hurt GPU performance.
"In the AI cloud, a lot of this stuff hurt performance, right? These Nitro NICs were bad for performance... No one rents a single GPU in a 72GPU rack. They rent the whole rack and in fact they rent many of the racks. And then there's no like, oh, I rent for six hours and I give it back. So the mechanics of the GPU rental market meant that a lot of the expertise of the hyperscalers fell away." — Dylan Patel 00:05:26
The Demand for AI Tasks Is Expanding Faster Than Compute Capacity
The TAM for any given model is not fixed — it expands dramatically with each capability jump, meaning a step-function model improvement creates more new demand than the period's compute increase can satisfy.
"The TAM for Mythos 5, Fable 5 is not just like 2x that of Opus, right? The model is so much better and it can do so many more tasks that the TAM for it is way larger... The world's compute did not double in that or quadruple in that same timeframe. But the demand for useful tasks that can be done by AI... has." — Dylan Patel 00:51:34
2. Contrarian Perspectives
The Model Layer Has Driven More Efficiency Gains Than Hardware Over the Last Three Years
The conventional narrative credits hardware generations (e.g., Hopper to Blackwell) as the primary driver of AI efficiency. Dylan disagrees sharply.
"I completely disagree with you, by the way... From Hopper to Blackwell, which is all we've had over the last three years, roughly 30x improvement on DeepSeek on the most optimized deployment. But over the last three years, we've had way more improvement intelligence per watt. A lot of that coming from the model layer, right? If you look back three years, it's GPT-4. Now it's like one of the smaller QwQ models that's like 27B parameters total and like 2 billion active is like way better." — Dylan Patel 00:24:45
China Didn't Invent Co-Design — Western Labs Just Don't Tell You What They Do
The popular take after DeepSeek's release was that China had discovered hardware-software co-design as a technique. Dylan says Western labs were already doing this; they simply don't publish it.
"I don't necessarily think so. I think it's more so that the West doesn't tell people what they do. Like, OpenAI didn't tell people that GPT-4o was how sparse it was, what the shape size was, all these things. But GPT-4o is roughly the same size, slightly smaller than DeepSeek V3. And 4.0 came out a little bit earlier, right?" — Dylan Patel 00:26:27
High Leverage + High Growth Is Not a Warning Sign — It's a Feature for Equity Holders
When Shaun expresses concern about the highly leveraged buildout in the AI infrastructure ecosystem, Dylan reframes it as exactly the condition that creates outsized equity returns.
"Wait, hold on. High leverage, high growth means small amount of equity has huge upside. Let's see. You're not a debt investor... You got to go to the school of private equity." — Dylan Patel 00:54:08
Models Are Improving Faster Now Than Six Months or a Year Ago — Not Plateauing
Against widespread pundit commentary about model improvement curves flattening, Dylan argues the feedback loop between models and engineering is creating a pseudo-recursive acceleration.
"Models are improving faster than they were six months ago or a year ago because the models are helping write all the info and launch the next model sooner and sooner and sooner. So you've got this like pseudo recursive self-improvement loop going." — Dylan Patel 00:55:23
Power Constraints Are More Solvable Than Commonly Presented
The energy bottleneck narrative is treated as a hard ceiling. Dylan offers a concrete, non-obvious workaround using existing industrial supply chains.
"Take the millions of diesel engines for trucks that the US has the capacity to make. You can very trivially convert them to be using gas in the assembly line and then stick them up to an electrical motor back driving it. So the electrical motor generates electricity rather than the electrical motor causing the rotation of the wheel... You've generated electricity by pumping gas into something the US can make millions of... You can just pull people out of car mechanic shops and have them run around and repair truck engines." — Dylan Patel 00:31:57
3. Companies Identified
SemiAnalysis
Research and analysis firm covering the semiconductor and AI infrastructure supply chain. Dylan Patel founded and runs it. Reportedly passed $100M in revenue with 90 employees — a hybrid of deep technologists and former hedge fund analysts. Runs InferenceX benchmarking platform, has over $50M (soon $100M+) in donated hardware from cloud providers and chip makers.
"There are rumors that semi-analysis recently passed a hundred million of revenue... We have 90 people and a big chunk of them are technologists, engineers across the whole supply chain. And then a big chunk is people who are formerly at hedge funds." — Sonya Huang / Dylan Patel 00:01:03
InferenceX
SemiAnalysis's continuously updated, automated, open-source inference benchmarking platform. Runs daily on $50M+ in donated hardware across 15+ chip types from cloud providers; benchmarks all major models and publishes Pareto-optimal inference configurations for free.
"We got CoreWeave and Crusoe and Nebius and Oracle and Microsoft and Amazon and Google and OpenAI to contribute to us compute... We've got over $50 million of hardware donated to us. Once we launched TPUs and Tranium, it would actually be over $100 million of hardware... Maybe about 15 different chip types, all running these benchmarks every single day on all the latest models." — Dylan Patel 00:16:08
NVIDIA
Dominant AI compute company. Praised for co-optimizing from the model layer all the way down to silicon; strategically investing in NeoClouds and NeoLabs to ensure a multipolar AI world that prevents hyperscaler dominance.
"You look at a company like NVIDIA, who's not co-optimizing on the model layer per se, but a little bit from the model layer all the way downstream to silicon." — Dylan Patel 00:28:49
TSMC
World's leading semiconductor foundry. Called out for co-optimizing not just fabrication but the entire upstream chain including components, consumables, and tools.
"You look at a company like TSMC, they're co-optimizing not just fabrication, but all the way from the components and the consumables and the tools all the way upstream to what the designs — their chips or the customers are telling them." — Dylan Patel 00:28:49
Anthropic
AI lab. Highlighted for being already profitable at ~80% per-token gross margins on flagship models, renting compute above market rate and remaining EBITDA-positive, and having co-optimized heavily with AWS Trainium.
"Anthropic in Q2 is profitable. They're net income profitable, excluding stock-based compensation... Their margins on an Opus token, at least Opus 4.8 token, is like north of 80% for the API price." — Dylan Patel 00:52:31
CoreWeave
NeoCloud GPU compute provider. Praised for objectively better AI compute performance versus hyperscalers; noted for strong team and ability to deliver reliable, fast-to-market compute, though challenged by balance sheet constraints vs. hyperscalers.
"When CoreWeave builds a gigawatt, even though their GPU compute is objectively better than Amazon or Google or Microsoft's in terms of performance — we've tested the performance and reliability — the problem is Google sells it six months before they have it up." — Dylan Patel 00:36:34
Crusoe
NeoCloud GPU compute provider. Praised as a phenomenal team — originally crypto/flare gas background — that has emerged as a legitimate compute provider.
"Oddly Crusoe, who's a bunch of crypto guys who then started building data centers and doing flare gas stuff... I gotta say both those teams are phenomenal." — Dylan Patel 01:08:08
SGLang / vLLM
Open-source inference serving frameworks. Described as leading open-source efforts and core collaborators in the InferenceX benchmarking project.
"We were able to work with SGLang and vLLM and now Radix, Arc, and Infraact, which are the private companies who are sort of leading those efforts, the open source efforts, to collaborate with us." — Dylan Patel 00:16:36
Google / DeepMind
Highlighted for running three parallel, architecturally distinct TPU design programs simultaneously with different chip partners (Broadcom, MediaTek, and a third undisclosed), and for sophisticated power management enabling them to sell more gigawatts than their contracted power.
"Google actually has three different design programs for TPUs. They're making a TPU with Broadcom. That's a different architecture than the TPU with MediaTek. That's a different TPU than the architecture that is — I won't disclose — but they're making different architectures." — Dylan Patel 00:49:38
SpaceX / Starlink
Noted as a major AI compute provider (XAI GPU cluster), praised for delivering compute that is already live and billable vs. forward-sold capacity from NeoClouds. Sequoia is described as a "very large" investor. Also highlighted for networking and power management expertise from Starlink and Tesla respectively.
"SpaceX was like, no, no, no, this is running now. Buy it, right? And it's a big discrepancy when you have a balance sheet to do that versus not." — Dylan Patel 00:36:34
Thinking Machines (Tinker)
A NeoLab called out as a surprise commercial success story despite media coverage of talent departures, with "a few hundred million dollars of ARR" from a product less than six months old.
"Thinking Machines has a few hundred million dollars of ARR. That's pretty impressive, even though they've had, you know, in the media, it's like, oh, they've lost all this talent. It's like, well, but Tinker is doing a few hundred million dollars of ARR. That's pretty impressive for out of the gate, a product that's less than six months old." — Dylan Patel 00:09:12
Cerebras
Mentioned as a genuinely innovative company with excellent fast inference, used by SemiAnalysis itself exclusively in "fast mode." Risk flagged around inability to run very large or long-context models at scale.
"I think Cerebras is a really innovative company. I think in some spots of the market, they're really, really good. Very fast inference. I think that's a big market. We use fast mode almost exclusively at SemiAnalysis." — Dylan Patel 00:38:47
Groq
Mentioned alongside Cerebras as doing "weird shit" on chip architecture — a compliment. Also flagged with similar large-model/long-context constraints.
"Running really large models at really long context is very difficult on SRAM-based chips like Cerebras, like Groq." — Dylan Patel 00:40:03
DeepSeek
Chinese AI lab. Used as the canonical public example of hardware-software co-design done well — V3 optimized for Hopper, V4 optimized for Blackwell and Huawei's chip. Runs poorly on TPUs precisely because of this co-design.
"If you look at the shapes of all the experts in DeepSeek V3, they were all optimized for Hopper. And if you look at for V4, they're optimized for Blackwell and Huawei's chip... TPUs suck at running DeepSeek." — Dylan Patel 00:25:18
AWS Trainium
Amazon's custom AI chip. Noted as being priced below $10B per gigawatt per year to Anthropic, lower than GPU rates, and described as legitimately excellent hardware that Anthropic helped make useful by writing the software stack.
"Trainium sells at sub $10 billion per gigawatt rental rate to Anthropic and to OpenAI... Everything I hear is that Trainium's really freaking good hardware and it's getting way, way, like way better." — Dylan Patel / Sonya Huang 00:58:18
Naveen Rao's Company (MosaicML follow-on)
Not named explicitly but invested in by Sequoia. Described as pursuing a long-horizon bet on co-designing analog compute, energy-based models, and Silicon simultaneously — something that "definitely won't work quickly" but is genuinely exciting at the frontier.
"He's trying to innovate on the Silicon layer on the software abstraction layer and the model layer simultaneously... We're going to bring like potentially like analog compute with energy-based models and like all this crazy stuff all at once. That's exciting." — Dylan Patel 00:45:12
ASML
Referenced in the context of SPIE advanced lithography conferences as one of the few English-speaking attendees at deep-niche chemical/supply-chain conferences — indicating their deep involvement at the furthest upstream layers of the stack. 00:11:34
4. People Identified
Dylan Patel
Founder and CEO of SemiAnalysis. Built the premier semiconductor and AI infrastructure research firm from scratch — starting as an anonymous internet poster — to reportedly $100M+ revenue with 90 employees. Former quant. Grandmaster-level StarCraft player. Attends 40+ global conferences per year. Runs InferenceX. Unique for bridging deep technical knowledge with supply-chain economics.
"It's pretty insane what you've done. Semis five years ago were not very sexy in the West... You created probably the premier research company in the space. It's been educating the world and the state of the art from very technical details to supply chain to the bigger picture." — Sonya Huang 00:01:03
Jensen Huang (NVIDIA CEO)
Called out for playing deliberate long-term strategic chess by seeding NeoClouds and NeoLabs globally to prevent hyperscaler and closed-lab concentration — protecting NVIDIA's long-term revenue base.
"Jensen absolutely hates a world where all the hyperscalers have all the power. There's a reason he's blowing money on random AI labs... he wants to create a multipolar world." — Dylan Patel 00:07:07
Naveen Rao
Former Intel AI lead, founder of MosaicML (acquired by Databricks), now running a new venture backed by Sequoia. Praised for being ahead of his time, mentoring younger talent, and pursuing a genuinely long-horizon hardware-software-model co-design vision.
"I think he's one of the first people I met in the industry... I baited him on the internet and he started replying... He's always trying to help the younger generation. He's trying to identify talent." — Dylan Patel / Sonya Huang 00:45:56
Larry Page
Called out for one of the greatest private investments of all time — investing $1 billion in SpaceX at a $10 billion valuation, acquiring roughly 10% which has since compounded enormously.
"Larry Page invested a billion dollars at a $10 billion valuation, got 10% of the company. It got diluted, like all this. But that was one of the greatest investments of all time. Good job, Larry." — Sonya Huang 00:56:00
Brett Adcock (Brett Mayo referenced as likely Brett Adcock / Figure AI or XAI context)
Mentioned specifically in the context of SpaceX's compute operation as an example of people that others underestimate for their networking and power management expertise.
"People like Brett Mayo are incredible. For me, that's actually probably the thing that might be missing from the analysis a lot of people are doing." — Sonya Huang 00:03:31
Chase (Crusoe)
Mentioned by first name as an example of a team member at Crusoe who is "hyper levered equity owners" and thus intensely motivated to deliver compute faster than large organizations.
"You look at Crusoe, for example, Chase, and all the other people at the team... all these people are getting rich if they fucking deliver this compute faster. They're levered, they're hyper levered equity owners." — Dylan Patel 00:06:20
5. Operating Insights
Track Individual Token Spend Daily and Interrogate Spikes
SemiAnalysis runs a daily internal monitoring system for AI token consumption by individual employee, treating anomalous spikes as triggers for a brief ROI conversation rather than criticism — normalizing cost accountability without killing usage.
"We do it pretty diligently... we also track everyone's token spend by day. And if someone's spiked up, I'm like, what did you do? It's like, okay, thank you for telling me that that seems worth it. Cool. On with my day." — Dylan Patel 00:39:13
Build a Team That Fuses Domain Engineers With Former Investors and Let Them Fight It Out
The SemiAnalysis model of embedding ex-hedge fund analysts alongside deep technical engineers — and allowing informal, friction-filled debate — is what produces conclusions that are both technically correct and economically grounded. Few organizations deliberately design for this tension.
"We have 90 people and a big chunk of them are technologists, engineers across the whole supply chain. And then a big chunk is people who are formerly at hedge funds. And you see these arguments like people are like, oh, that doesn't matter. And then someone's like, well, but cost. And then the engineer's like, no, no, no, but this technology is the coolest. And you see this organically like fight it out." — Dylan Patel 00:41:33
Go to the Arcane Niche Conference Four Times Before Trusting Your Own Judgment
Dylan's rule of thumb for deep-supply-chain knowledge: attending a highly specialized conference (e.g., SPIE Advanced Lithography) even three or four times still leaves meaningful gaps. This is a forcing function for humility and continued learning investment.
"I went to them the first time. I didn't even understand 90% of what I heard... The next time I went, I understood like half... Third time I went, I understood like 75%... Even now I went and I was like, I still don't understand everything." — Dylan Patel 00:12:04
Separate the Throughput-Latency Curve Into Distinct Pricing Tiers to Capture Different Willingness to Pay
Claude Code "fast mode" and OpenAI's priority queue are early examples of a structural shift: the same underlying model can command radically different prices depending on how time-sensitive the token delivery is. Operators building on top of AI should architect their cost structures around this curve rather than treating inference as a single cost.
"The way we treat AI infrastructure is it's like one size fits all. But over time, we're going to get to the point where there's stuff where you have batch workloads or you need instant response. And there's the whole curve that's going to matter for users. We see this with Anthropic, right? Claude Code fast mode costs way more than regular mode." — Dylan Patel 00:18:45
6. Overlooked Insights
Google Is Quietly Running Three Architecturally Distinct TPU Design Programs Simultaneously
This was mentioned in a single, dense passage and received no follow-up from the hosts. It is a major strategic signal: Google is not simply iterating on one chip design with different contract manufacturers. It is hedging across fundamentally different chip architectures — one with Broadcom, one with MediaTek, and a third undisclosed design. This suggests Google believes the optimal AI chip architecture is not yet known, even internally, and is running a portfolio of architecture bets in secret. For investors, this implies the "TPU vs GPU" framing is far too simple, and that even within Google's own stack, the winning architecture is genuinely uncertain.
"Google actually has three different design programs for TPUs. They're making a TPU with Broadcom. That's a different architecture than the TPU with MediaTek. That's a different TPU than the architecture that is — I won't disclose — but they're making different architectures. It's not just like, oh, they're making TPUs with a couple vendors and it's the same architecture. It's different architectures. And the third one is a very different architecture from the first two." — Dylan Patel 00:49:38
Stacking Memory Directly on the Compute Die — Not Just Adjacent — Is an Imminent Breakthrough That Could Redefine Bandwidth Economics
This was mentioned in passing during a discussion of memory bottlenecks and received no elaboration. Current HBM is stacked beside (not on) the logic die. The transition to on-die memory stacking would cause bandwidth to "explode" — a step-function improvement, not incremental. Combined with the observation that the DRAM cell itself has seen no fundamental innovation in 40 years, this is a rare case where a known-but-undated breakthrough could suddenly unlock a new architectural generation. There are reportedly already companies working on this.
"There's new innovations coming in the next few years where instead of stacking the HBM separately from the chip, you stack the memory directly on the chip and that makes your bandwidth explode. And so there's interesting companies in that space and interesting POCs that companies are trying to do there." — Dylan Patel 00:30:11