138. 对罗福莉3.5小时访谈：AI范式已然巨变！OpenClaw、Ag…

Participants: Zhang Xiaojun (Host), Luo Fuli (AI Researcher, Head of Xiaomi Large Model Team, formerly Alibaba DAMO Academy and DeepSeek)

1. Key Themes

The Agent Era Has Arrived and OpenClaw Is Its Landmark

Luo Fuli describes a paradigm shift from pre-training-dominated ChatBot era to a post-training-dominated Agent era, with OpenClaw (open-source Claude Code alternative) as the defining inflection point.

"I myself actually regard OpenClaw as an epoch-making agentic framework — that's how I define it." 00:02:27

She recounts spending from 2am to 6am on her first night using it, and by day three it was helping her design research architectures:

"From thinking it was just a product with soul and warmth, to it helping replace parts of my life and work, to finally helping accelerate my research — all of that happened in three days. Every day it gave me additional surprises." 00:07:47

Post-Training Is Now Equal to Pre-Training in Importance

A massive structural shift is underway in how compute should be allocated across the model development lifecycle. This has direct implications for who wins the next phase of the AI race.

"I think a very reasonable GPU allocation ratio is perhaps 3:1:1 — for research, pre-training, and post-training. The compute invested in pre-training and post-training should be roughly equivalent. And research should have even more GPUs than the total training GPUs combined." 00:01:30

"In the ChatGPT era, the ratio was something like 3:1 or 5:1 favoring pre-training. This year, it's probably already 1:1 at top teams." 01:48:40

Open-Source Agent Frameworks as Collective Intelligence Accelerators

The open-source nature of OpenClaw is not merely a product strategy — it enables group intelligence to improve the agent framework itself at a speed no single company can match.

"What truly moved me was when everyone started together modifying the framework itself. Because when you see someone else using OpenClaw to accomplish something, it sparks your own imagination. Individual imagination is really limited, but when you see what others can do, it multiplies." 00:26:30

"I'm now very pleased to see OpenClaw's star count flying up. I think this is something that absolutely must happen before AGI arrives." 00:31:28

2. Contrarian Perspectives

OpenClaw Is Not Just a Better UI — It's a Framework That Compensates for Model Weaknesses

Most people dismissed OpenClaw as "just Claude Code with an IM interface." Luo Fuli argues it is fundamentally different because its entire design philosophy is to compensate for model shortcomings through agent orchestration.

"When I used Quark Code, if the model's video understanding capability wasn't good, I'd have to configure a better video understanding model myself. But with OpenClaw, I don't need to think about this at all — I just send it a video and it figures out on its own which model to use." 00:08:46

"I believe OpenClaw's core product logic from the beginning was: try as much as possible to compensate for model weaknesses through the entire agent orchestration system." 00:09:44

Benchmarks Are Now Largely Irrelevant for Agent-Era Model Development

In a field obsessed with leaderboard rankings, Luo Fuli argues that during paradigm shifts, body-feel testing beats benchmark optimization.

"When optimizing this version of the model, we basically abandoned those benchmarks. When you're facing a very large paradigm change, as long as the path is right, you can briefly — very briefly — ignore evaluation. Because through body-feel alone you can immediately detect a very large qualitative difference." 00:47:16

"Many models that perform very highly on those benchmarks — it doesn't mean their agentic ability is truly strong." 00:46:47

Larger Teams Are a Disadvantage When Debugging Training Instabilities

Counter to conventional wisdom that training frontier models requires massive coordinated teams, Luo Fuli argues small teams are actually better at the hardest parts.

"I don't believe that large teams have an advantage for discovering a possible problem, deeply investigating the root cause, and resolving it during model training. A large team may actually be a disadvantage." 01:55:07

"Training this model [1T parameter] — how big a team? Very small. Just for the training itself. Data also needs a few people. And you need a good infrastructure team." 01:54:38

Environment Matters More Than Experience When Hiring

In an industry that fetishizes credentials and pedigree, Luo Fuli hires for curiosity and puts people in environments that rapidly develop capability.

"These capabilities can all be rapidly washed in. At most one to two months, slower cases three to four months — they really can all be rapidly developed. So the environment is actually more important than experience." 03:24:20

"I care only about whether his initialization ceiling is high enough. I don't care much about the current state of the point he's already been supervised and led to." 03:25:17

Multi-Agent Systems Don't Actually Raise the Ceiling — They Only Improve Speed and Cost

While multi-agent is heavily hyped, Luo Fuli makes a sharp distinction between efficiency gains and capability gains.

"I haven't seen evidence that Multi-Agent can definitely ultimately achieve a higher upper limit. It can improve efficiency — the speed at which a task is completed — and it can definitely save costs. But I have not seen that Multi-Agent can definitely ultimately achieve a higher ceiling." 01:05:54

3. Companies Identified

OpenClaw (Open-source Agent Framework, now acquired by OpenAI)

Description: Open-source agentic coding and task completion framework built as an alternative to Claude Code. Why mentioned: Identified as a paradigm-defining framework that compensates for model weaknesses, enables community-driven evolution, and triggered a fundamental shift in how Luo Fuli and her team think about post-training. Its open-source nature is key to its impact.

"OpenClaw — because it's so open, you can try modifying it yourself... This kind of original manipulability gave me a very strong sense of impact." 00:17:34

"I still haven't seen an agent framework or product that is progressing faster than the OpenClaw open-source community. So I'd rather use the latest OpenClaw." 01:18:28

DeepSeek

Description: Chinese AI lab known for architectural innovations including MoE and MLA attention mechanisms. Why mentioned: Cited as a counterpoint to scaling-focused labs, for focusing on innovative architecture to achieve efficiency under compute constraints. Credited with influencing inference chip design globally.

"DeepSeek cared more about seeing what problems the LLAMA-generation architecture had, rather than just rushing to scale. It more focused on: given LLAMA's current architecture, what problems arise when scaling, and what new structures could solve them." 02:52:00

Kimi (Moonshot AI)

Description: Chinese AI startup building frontier models. Why mentioned: Cited as one of the first Chinese companies to pivot toward the agent paradigm and achieve 1T+ parameter base models, and described as pursuing the "Anthropic path" rather than DAU-chasing.

"Someone from Kimi told me they feel they and Doubao are playing different games now... Kimi feels they're on the Anthropic path." 01:06:55

"MiniMax I think is relatively fast [in pivoting to agents]. Because they used a 10B model to achieve the current level of agent capability — I find that quite impressive." 03:05:36

Xiaomi / MemoVR (Memo Large Model Team)

Description: Xiaomi's internal large model team, led by Luo Fuli, that produced the MemoVR Flash, Pro, Omni, and TTS series models. Why mentioned: Demonstrated that a small team (~20-30 core contributors) with the right architecture bets (Hybrid Attention + MTP) and operating philosophy can punch well above its weight class in frontier model development.

"We basically in three to four weeks accomplished what might previously have taken thirty to forty weeks of research." 00:28:28

4. People Identified

Luo Fuli (罗福利)

Description: AI researcher, formerly at Alibaba DAMO Academy and DeepSeek, currently head of Xiaomi's large model team. Led development of MemoVR series. This was her first long-form technical interview. Why mentioned: Demonstrates rare combination of frontier technical depth (architectural choices, training instability debugging, post-training paradigm design) with organizational philosophy (flat teams, curiosity-driven hiring, no formal group divisions). Her early recognition of OpenClaw's significance and rapid pivot to agent-era post-training is an example of the kind of intellectual agility that distinguishes leading researchers.

"I feel like every two weeks, the things we've done make us find it hard to believe they happened in just those two weeks." 01:00:12

"I basically use my own experiments as the source. Even communicating with others — I do very little of that recently. So I don't know whether what I've said today in these hours will turn out to be wrong after a while." 03:23:23

5. Operating Insights

Mandatory Immersive Product Exposure as a Team Activation Tool

Luo Fuli's technique for forcing paradigm adoption is not a memo or presentation — it is forced immersive usage with social accountability through shared group chat.

"I bought a few machines, deployed OpenClaw on them, put everyone in different OpenClaw groups, and forced people to explore in different directions... Why have everyone discuss in the big group? Because individual imagination is truly limited. But when you see someone else using OpenClaw to accomplish something, it sparks your own imagination." 00:25:32

The "100 conversations or you're fired" threat was theater — the point was environment creation, not enforcement:

"I said: you just use it. I said I have my own assessment method. Actually my assessment method is: I won't assess. I just want everyone to start using it." 00:26:30

The 3:1:1 Compute Allocation Framework for Agent-Era Model Development

For teams building frontier models in the agent era, Luo Fuli proposes a concrete compute allocation heuristic that inverts conventional wisdom about pre-training primacy.

"I think a very reasonable GPU ratio is perhaps 3:1:1 — for research, pre-training, and post-training. Pre-training and post-training should have roughly equivalent compute invested. And research should at minimum have even more GPUs than your total formal training GPUs — you need to reserve extra GPUs for research." 00:01:30

No Org Chart, No Groups, No Reporting Lines — Intentional Organizational Flatness for AI Research

The specific mechanism: eliminating group designations prevents capability siloing and allows pre-training researchers to naturally migrate to post-training work (where their diversity instincts are actually a competitive advantage).

"If you divide groups very clearly and rigidly, you are actually strangling part of people's creativity... Many people who do pre-training, their first concern should be diversity. If pre-training people go do post-training, they have a huge advantage — they naturally care about diversity." 01:58:56

"There's no direct reporting. You could say that. But Xiaomi itself has reporting lines — it's just that our team's organizational structure is completely flat. No direct reports." 01:59:55

Lean on the Frontier Model to Design Your Own Infrastructure, Then Hand Off to Smaller Models

A concrete research acceleration technique: use Claude Opus 4.6 (the most capable model) to design and modify complex agent architectures, then migrate the resulting framework to run on smaller, cheaper models.

"I let Claude 4.6 Ops help me redesign the entire new multi-agent system... Once Claude 4.6 helped me get the framework right, I switched to Sonnet, then to domestic models, even to VR Pro we were training — and I found it was very powerful." 00:17:34

6. Overlooked Insights

MTP (Multi-Token Prediction) at Inference Time Is an Untapped Competitive Moat

Luo Fuli briefly mentions that MemoVR's use of MTP at inference time (not just training time) is nearly unique in the industry, and explains precisely why — it exploits compute headroom that MLA-based architectures (DeepSeek, Kimi, Gemma) structurally cannot access.

This is a non-obvious architectural moat: MLA was designed to hit a perfect compute/memory balance on H-series chips, which means it cannot use MTP at inference without becoming compute-bound. MemoVR's hybrid sliding window architecture deliberately left compute slack, which MTP then fills.

"We realized the remaining computation was really far too much — we hadn't anticipated there would be that much surplus... MTP is perfectly suited for this. We suddenly realized one day when designing the inference architecture." 01:32:33

"You'll notice that all MLA model architectures — whether Gemma or Kimi — I'd guess none of them have deployed MTP at inference, because once they do, they hit compute bounds again, making it very inefficient. So their models will all be somewhat slower." 01:29:41

The implication: as agent workloads push toward longer contexts and higher throughput, MTP-at-inference becomes increasingly valuable. Teams evaluating inference infrastructure or model providers should specifically test token generation speed at long contexts as a discriminating signal — not just benchmark scores.

Discrete Audio Tokenization as a Unification Strategy Is Being Quietly Validated

Buried in the TTS/Omni discussion is a significant research direction: Luo Fuli's team is attempting to tokenize audio (and eventually image) into discrete tokens unified with text, enabling a single training and RL infrastructure for all modalities. They have reportedly crossed the threshold on audio.

"We want to unify everything into two types of modal paradigms. So on audio modeling, we want to discretize it — make it the same as text, discrete token IDs... We've already crossed that threshold on audio." 02:09:37 and 02:12:05

The significance: if successful for images too, it eliminates the need for separate vision encoders and enables next-token prediction (and therefore RL training) over all modalities with a single infrastructure. Most labs (including Doubao, international labs) are using architectures that keep modalities separate. This would be a fundamental simplification with compounding advantages for post-training. The team is currently attempting to validate this for images.