Manchi | 晚点聊 LateTalk Summary

Podcast: 晚点聊 LateTalk | Episode: 163 Participants: Manchi (host), Liu Yifeng (UCLA PhD, model architecture), Zhao Chengyang (SGLang core team / RedixArc founder, Infra)

1. Key Themes

The "Combinatorial Explosion" Engineering Bet

DeepSeek V4 simultaneously introduced four major coupled innovations — mixed attention (SWA + CSA/HCA), MHC residuals, Muon optimizer, and FP4 training — each individually requiring enormous engineering effort, but launched together on a 1.6T parameter model.

"DeepSeek introduced four mutually coupled new features at once: mixed sliding window attention, MHC, Muon as a new optimizer, and FP4 training. Any single one of these going live requires an enormous amount of debugging. Four at once — this is a combinatorial explosion problem." — Zhao Chengyang [00:09:50]

Million-Token Context Crossing from Theory to Practical Reality

V4's architectural choices (token-wise compression, sparse attention, FP4) collectively brought long-context inference from theoretically possible to economically viable. The efficiency gains are context-length dependent: the longer the context, the more dramatic the improvement.

"These four things together make the previous million-context window go from theoretically feasible to cost-acceptable." — Zhao Chengyang [00:28:53]

"The context length has to be long enough. If your actual test scenario is only a few thousand tokens, the Flops savings of V4 relative to V3.2 won't be this extreme." — Zhao Chengyang [00:24:59]

The MOE Activation Ratio Arms Race — and Its Difficulty Ceiling

V4 pushed the activated-parameter-to-total-parameter ratio to approximately 3% (49B active out of 1.6T total), the lowest among frontier open-source models. This ratio is harder to push down than it looks — the difficulty increases exponentially, not linearly.

"The difficulty is rising exponentially with that number. Going from 5% to 4% is one thing. From 4% to 3% is far, far harder... I am in awe that they could continue pushing toward the extreme." — Zhao Chengyang [00:30:50]

"The ratio of total to active parameters in V3 was about 18:1. Now in V4 it's close to 40:1. This places extremely high demands on algorithm and low-level operator development." — Liu Yifeng [00:31:19]

2. Contrarian Perspectives

DeepSeek Not Disclosing Training Cost Is a Signal of Maturity, Not Concealment

V3 famously disclosed $5.57M training cost. V4 disclosed nothing. Most observers speculate they're hiding a higher cost. The panelists argue the opposite: it signals DeepSeek no longer needs a cost narrative to define itself.

"I think choosing not to proactively disclose it is itself a signal. They are no longer a team that needs to define themselves through cost narratives. Instead, they want to let the model's capabilities speak for them." — Zhao Chengyang [00:08:21]

Frontier Models Are Already Indistinguishable for Most Use Cases

One panelist switched from Claude to Codex for a day due to billing issues and found essentially no difference — suggesting the differentiation between top models is overstated in public discourse.

"My conclusion was: without Claude, the world did not stop raining. I strongly believe these models are already very hard to differentiate in my use cases." — Zhao Chengyang [00:21:05]

Token Efficiency Is Getting Worse, Not Better, Despite Architectural Improvements

Despite V4's dramatic per-token Flop reduction, users report it consumes more tokens to solve the same problem. This creates a troubling feedback loop where training incentives reward task completion, not efficiency.

"There is a kind of aesthetic beauty in the waste — like using a high-pressure fire hose to water flowers. The model faithfully reflects the training it received. In the training data, part of it must involve solving the same problem with longer outputs, creating a very bad loop." — Zhao Chengyang [00:27:25]

Linear Attention Has a Hard Performance Ceiling; Sparse/Sliding Window Will Win

Despite much academic excitement about linear attention (e.g., Gated DeltaNet), its information compression at long distances fundamentally limits its upper bound for reasoning-heavy tasks.

"For models pushing for maximum capability, they will definitely lean toward sparse and sliding window attention. Linear attention, at each token step, continuously compresses information — so for long-horizon reasoning tasks like math derivation, its ceiling is lower." — Liu Yifeng [00:36:30]

The Real Bottleneck Isn't Context Length — It's Wasted Context

The more important unsolved problem is not expanding to 1M tokens but using existing context budget far more efficiently — a challenge the agent community hasn't seriously internalized yet.

"I think the agent community needs to pivot its thinking. How do we use context more efficiently? People have been spoiled by Infra that supports 1M tokens. But we could be doing far more within 1M tokens." — Zhao Chengyang [00:16:12]

3. Companies Identified

SGLang / RedixArc Open-source inference framework deployed on 400,000+ GPUs globally; commercial arm RedixArc. Mentioned because they successfully ran both inference and RL pipelines for DeepSeek V4 on launch day — a feat previously taking months for open-source communities.

"Our team did substantial engineering optimization and successfully ran both the inference and RL pipelines on the day DeepSeek V4 was released." — Zhao Chengyang [00:02:29]

Kimi (Moonshot AI) Chinese frontier AI lab. Praised for their Moonlight optimizer improvement (fixing the Muon learning rate ratio to ~0.2), which unlocked Muon for practical large-scale use. Also noted for Attention Residual architecture innovation (DenseNet-style cross-layer connections).

"Moonlight's important contribution was nailing down the ratio coefficient — approximately 0.2 — making it so you only need to tune one learning rate for the whole model. That's when Muon went from theoretical innovation to real large-scale application." — Liu Yifeng [00:42:45]

Anthropic (Claude) Praised specifically for Claude Code's agentic coding capability, described as making a generational leap post-4.5, with very strong internal RLHF/RLAIF data flywheels.

"From Claude 4.5 onward, the multi-step agentic coding performance compared to before improved enormously. You can imagine that RLHF or RLAIF, after years of accumulation plus massive high-quality human feedback data, has formed a very powerful data flywheel in the US." — Zhao Chengyang [00:26:35]

TileLang (open-source project, Peking University) A domain-specific language for writing GPU kernels — positioned between Triton and raw CUDA. Now adopted by frontier AI labs globally for fast high-performance kernel development.

"TileLang's long-term value is in dramatically reducing the engineering cost of rapidly developing high-performance kernels for new algorithms. TileLang has now been adopted by frontier labs as one of the default choices for algorithm implementation." — Zhao Chengyang [00:57:41]

Eleven Labs Voice AI company. Briefly cited as example of a company thriving in a niche (audio/voice) that's somewhat insulated from the hypercompetitive LLM race — operating in its own commercially healthy space.

"Eleven Labs seems to be existing in a relatively self-contained space." — Manchi [01:23:39]

4. People Identified

Zhao Chengyang (赵承阳) Co-founder of RedixArc, core contributor to SGLang. Former RL systems engineer; witnessed R1's launch impact firsthand. Described as having deep Infra expertise on frontier model deployment.

"I've personally witnessed DeepSeek R1 bring enormous influence to the LLM domain. In some sense it gave this field unprecedented attention." — Zhao Chengyang [00:02:00]

Liu Yifeng (刘一峰) UCLA PhD student in ML, specializing in LLM pre-training, optimizers, and model architecture. Previously worked on foundation model development at Yuanai and ByteDance.

"I'm developing new large model training algorithms, and also participating in AI-related project development using current industrial-grade models." — Liu Yifeng [00:01:03]

Kelvin Jordan (Kyle Jordan) Individual developer who proposed the Muon optimizer. Hired by OpenAI in December 2024 based on this work alone.

"The optimizer's developer, Kelvin Jordan, was recruited by OpenAI in December 2024 based on this achievement. He was originally an individual developer." — Manchi [00:39:54]

Yang Zhiling (杨植麟) / Kimi team Kimi's leadership repeatedly called out at GTC and other venues for their Moonlight optimizer work — which became the practical bridge making Muon usable at scale.

"At GTC and multiple occasions, Yang Zhiling kept discussing Kimi's optimization of Muon — the version called Moonlight." — Manchi [00:40:22]

Yao Shunyu (姚舜宇) Described as a senior figure (学长 — senior alumni) of both podcast guests. Recently returned to lead Hunyuan (Tencent), launching a ~300B parameter model. Flagged as someone to watch, especially with potential WeChat integration.

"Yao Shunyu, our senior, has returned to lead Hunyuan. The 300B-scale model they've released is very solid. If 3.0 gets into WeChat, the competitive landscape could get very interesting." — Zhao Chengyang [01:25:06]

5. Operating Insights

Use Optimizer Adoption as a Proxy for Infra Team Quality

Whether a model team has successfully migrated to the Muon optimizer (from AdamW) is a surprisingly reliable signal of overall engineering depth — because it requires solving distributed training, complex state management, and post-training consistency problems that most teams haven't tackled.

"You can use Muon optimization as a good litmus test for assessing a team's engineering optimization capability. I strongly agree with that." — Zhao Chengyang [00:48:30]

"Pre-training and post-training optimizers basically have to stay consistent. If the post-training side hasn't migrated to Muon yet, pre-training likely has to stay on AdamW too — because post-training Infra is even harder to change." — Liu Yifeng [00:44:41]

For Agent Builders: V4's Architecture Advantage Only Manifests at Long Context

Engineering teams building agents on V4 should specifically design for long-context workloads to capture the architectural benefits. Short-context (<8K token) agent calls will not benefit meaningfully from V4's efficiency improvements.

"Quick takeaway: the longer your context, the more significant the efficiency advantage. If you're only using a few K tokens, there's no very obvious improvement." — Zhao Chengyang [00:25:28]

Separate Your Evaluation Framework from Your Benchmark Scores

Benchmark scores saturate within 6-12 months. Building durable evaluation capability — particularly for agentic, multi-turn, and tool-use scenarios — is the real competitive moat. Teams that can't measure improvement will optimize blindly.

"We cannot optimize what we cannot evaluate. If we don't have a score for the capability we want to improve, we have no idea if our optimization is right... If we don't do evaluation well, this industry will fall into a self-deceiving vicious cycle." — Zhao Chengyang [01:13:48]

6. Overlooked Insights

DeepSeek Validated Domestic Chinese Chip Inference — A Brief But Significant Mention

In a single sentence in the technical report, DeepSeek mentions doing technical validation of their inference pipeline on Huawei's Ascend chips. This was noted by the guests and then quickly dropped — but the implication is significant: DeepSeek may be quietly building hardware independence from NVIDIA, which would have enormous geopolitical and supply chain implications.

"In the section on infrastructure and the EP parallelism scheme, they mentioned: 'We did technical validation on Huawei's Ascend.' That's what you mean by native support for domestic chips — but this is for inference. Whether training used them... we don't know." — Zhao Chengyang [00:11:18]

Video Generation Models Are Anomalously Profitable — Evidenced by Nobody Open-Sourcing Them

In an otherwise LLM-focused conversation, it was briefly noted that no credible frontier video generation model has been open-sourced (the only exception being Alibaba's Wanxiang). The panelists interpret this as strong evidence that video generation is genuinely profitable — unlike LLMs where competitive dynamics push toward open-sourcing. This makes video generation a potentially underappreciated investment theme.

"I notice no one seems willing to open-source video generation models. That might itself be evidence that they're actually quite profitable... Voice model engineering optimization is far behind language model optimization — a lot of what's been done for LLMs could be re-implemented for voice models." — Liu Yifeng / Zhao Chengyang [01:22:10]