162: 中国最早AI创业者的故事,与唐文斌聊天才策略、旷视、十年前的机器人和具身新创业
- 01China's First-Generation AI Builders Are Now Leading Embodied Intelligence
- 02The Real Bottleneck in Embodied AI Is Intelligence, Not Hardware
- 03The "Grafting vs. Native" Training Distinction Is the Key Technical Debate
Participants: Manchi (host), Tang Wenbin (Co-founder & CEO of Yuanli Lingji / RoboticX) Podcast: 晚点聊 LateTalk, Episode 162
1. Key Themes
China's First-Generation AI Builders Are Now Leading Embodied Intelligence
Tang Wenbin co-founded Megvii (旷视) in 2011—one of China's earliest AI companies—built on computer vision. Now in 2025, he is co-founding Yuanli Lingji (原力灵机 / RoboticX), focused on embodied intelligence. The full arc from CV → logistics robotics → embodied AI represents a coherent, compounding technical journey rather than pivots.
"We wanted to let robots first see the world... Megvii's name at the time was MagV, MegaVision—let machines first understand the world. That was our thinking." [00:34:38]
The Real Bottleneck in Embodied AI Is Intelligence, Not Hardware
Tang consistently argues that actuators, cameras, and mechanical arms are not the binding constraint. The limiting factor is model intelligence—can the robot close the loop in a scene without failure?
"The core bottleneck today is still about the model and the brain... It's not a mechanical arm problem, it's not the camera—those are mostly okay. The problem is it's not smart enough to complete the task. So the bottleneck today is stuck at intelligence." [01:00:10]
The "Grafting vs. Native" Training Distinction Is the Key Technical Debate
Tang introduces a crucial distinction: most companies take a pre-trained VLM and bolt on an action module (grafting). Yuanli Lingji argues for native embodied training—integrating robot data from Day 1 of VLM pretraining.
"It's a bit like a student who just finished nine years of compulsory education, then you drag them to a sports school and make them train intensively. Two problems arise: first, because they didn't train from childhood, their athletic foundation has a ceiling. Second, in those three years of intensive sports training, you may have caused their language and math scores to collapse." [01:01:09]
2. Contrarian Perspectives
Automotive Factories Are NOT Good Scenes for Embodied AI — Despite Everyone Saying Otherwise
While Tesla, Figure, and others tout automotive factories as the ideal embodied AI deployment, Tang has direct experience and disagrees sharply.
"I think automotive factories are not good scenes. It's highly error-intolerant and highly rhythm-constrained. These two things are both very hard problems... Although everyone thinks the auto factory might be the best scene, I don't think it is." [01:54:27]
He adds that reality never matches imagination: "You'll find the packaging format, the product shape, the auxiliary materials, the position constraints, the container constraints—everything ends up completely different from what you imagined." [01:54:56]
The Spring Gala (春晚) Robot Demos Were Strategically Wasteful for Non-Performance Companies
Several embodied AI companies spent tens of millions to hundreds of millions of RMB on Spring Festival Gala performances. Tang frames this as misaligned with building real utility.
"Did every company that appeared on the Spring Gala achieve its intended effect?... We would never do it. Because what we're building is not a performance-type route." [01:53:27]
Virtually No Embodied AI Robots Are Actually Being Used Continuously Today
Despite the enormous hype, Tang delivers a cold assessment of real deployment. He defines "real use" as: 10+ hours powered on per day, continuously for 2 months, in a real production scene, at 100+ units. His conclusion:
"I think almost none [meet this standard]. So I believe today, the number of scenes that have achieved this kind of deployment is essentially zero." [01:22:15]
Simulation Data Is Largely Useless for Contact-Rich Manipulation
Against the prevailing enthusiasm for synthetic/simulation data (endorsed by figures like Li Fei-Fei, Su Hao, and Jensen Huang), Tang argues the sim-to-real gap is too large for fine motor tasks.
"For contact-rich manipulation tasks, simulation data is not very helpful right now. It's very hard to simulate accurately... the sim-to-real gap is large, so the contribution to the algorithm is limited." [01:08:03]
Talent Density Alone Doesn't Create Commercial Success—"This Is Not Essential" Culture Is a Bug, Not a Feature
Elite teams dismiss important but unglamorous work as "not fundamental." Tang identifies this as a critical failure mode.
"In practice, when you truly become a commercialized product, everything that impacts the customer is fundamental. It's not just the hardest final problem that is fundamental." [00:54:44]
3. Companies Identified
Megvii (旷视科技) Description: One of China's earliest AI/computer vision companies, founded 2011. Why mentioned: Tang co-founded it; used as deep case study for lessons in B2B AI, scaling, over-promising, focus, and what not to repeat.
"Megvii originally had very high talent density, but we were still diluted—because we did too many things." [01:55:53]
Alipay (支付宝) Description: Digital payments platform under Ant Group. Why mentioned: First scaled customer for Megvii's facial recognition; drove the jump from ~90% to 98-99% accuracy on LFW benchmark.
"Alipay surveyed every company providing facial recognition technology on the market. In the end, they found our results were the best." [00:24:18]
Uniqlo (优衣库 / Fast Retailing) Description: Global apparel retailer. Why mentioned: Megvii won a massive, highly complex logistics automation tender after nearly failing—then rebuilt all code from scratch to deliver. Tang eventually met the founder Tadashi Yanai.
"They felt this kind of tenacious spirit was really admirable. So afterward they handed us some of their other projects." [00:42:57]
Kimi (月之暗面) Description: Chinese LLM startup. Why mentioned: Zhou Xingyu (周星宇) of Kimi mentioned as a competitive programmer from the same community.
"Zhou Xingyu of Kimi also used to do competitive programming." [00:08:15]
DatabBricks / Hugging Face / MongoDB / Red Hat Description: Various developer infrastructure companies. Why mentioned: Tang cites them as examples of startups that successfully built open-source infrastructure with thriving ecosystems.
"Hugging Face counts, right? Android too... MongoDB, message queues—these all grew within the open-source ecosystem." [01:38:19]
Yuanli Lingji / RoboticX (原力灵机) Description: Tang's new embodied intelligence company, founded early 2025. ~100 people, 40% ex-Megvii, 60% new hires. Why mentioned: Central subject of the episode. Notable for: not building humanoids, participating in VLM pretraining natively with robot data, and building open infrastructure (DexBotics, RoboChallenge).
"Intelligent, useful, trustworthy robots—that's our mission statement." [01:26:00]
Jiyue / 极越 (千里) Description: Smart EV company affiliated with Yin Qi's network. Why mentioned: Has autonomous driving data; collaborating with Yuanli Lingji on joint VLM pretraining that includes driving data, robot data, and internet video data.
"We co-trained a native VLM multimodal model—we contributed robot data, Jiyue contributed some multimodal data, plus autonomous driving data." [01:05:35]
4. People Identified
Tang Wenbin (唐文斌) Description: Co-founder & CEO, Yuanli Lingji. Previously co-founder of Megvii. Former 7-year national coach for China's NOI/IOI competitive programming team. Why mentioned: Subject of the episode. Rare combination of elite technical depth, 15 years of B2B AI commercialization experience, and now embodied AI.
"I feel very grateful for this era... we've witnessed wave after wave of technological change, and many things once thought impossible have become possible." [02:04:09]
Yin Qi (印奇) Description: Co-founder of Megvii; now Chairman of Jiyue and Qianli. Why mentioned: Tang's co-founder and long-time collaborator. Described as having exceptional organizational talent, creativity, and strategic vision.
"He directed a skit for 27 classmates—all male. I was deeply impressed by his organizational ability." [00:12:09]
Fang Haoqiang (方好强 / 方浩强) Description: Joined Megvii as a high school student; led the breakthrough in deep learning for face recognition. Why mentioned: Represents the "young genius" strategy. Entered China's IOI national team in 9th grade, won silver at IOI as a 10th grader, then joined Megvii. Now a core figure at Yuanli Lingji.
"The first person in our entire company to do deep learning was Fang Haoqiang. We sent a high school student to probe it. In the end, the results were surprisingly good." [00:24:48]
Zhao Erjing (赵尔静) Description: Joined Megvii as a university intern; paired with Fang Haoqiang as the "power duo" that built Megvii's breakthrough face recognition pipeline. Why mentioned: Tang knew him since middle school (same hometown, same high school). Now at Yuanli Lingji.
"Zhao Erjing and Fang Haoqiang—we called them internally the 'power combination.' They did the whole pipeline from keypoint detection to recognition, all extremely well." [00:25:00]
Luo Tianchen (罗天成) Description: Elite competitive programmer; Zhejiang provincial team member. Why mentioned: Tang's fellow competitor from early days; now prominent in AI circles. Also loves Codeforces, which is used in modern AI benchmarking.
"The first time I participated in the NOI national competition, it was alongside Luo Tianchen. We were both on the Zhejiang provincial team that year." [00:07:46]
Tang Jie (唐杰) Description: Professor at Tsinghua; Tang Wenbin's Master's supervisor. Now a prominent AI figure. Why mentioned: Tang Wenbin was his first-ever graduate student.
"Tang Jie had just been promoted to associate professor and had just gained the right to admit students. He could only admit one person per year. That first person was me." [00:13:36]
Wang Yu (王宇) & Wu Wenxiong (吴文雄) Description: Tsinghua professors/researchers behind the RL framework "rlyml." Why mentioned: Their reinforcement learning framework complements DexBotics (which focuses on imitation learning). They are now discussing merging the two into a unified open-source project.
"We found that rlyml had already done this reinforcement learning framework quite well... so we discussed it with Wang Yu and the team, and now we're looking at merging them into a larger project." [01:35:55]
5. Operating Insights
Design for "Error Tolerance" Before Deploying Robots at Scale
Tang articulates a specific 4-part checklist for scenes suitable for today's embodied AI: error-tolerant, time-tolerant (no strict cycle time), generalized (not hyper-specific), and long-duration operations. The logistics scene satisfies these because failures can be routed to human workers by the dispatch system.
"Today I really can't achieve 100%. So you have to allow me to make mistakes, or you have to have a way to catch those mistakes when they happen. That's the first criterion." [01:17:49]
Open-Source Infrastructure Early—Timing Is Everything
Megvii built a deep learning framework (Mag Engine) in 2013 before TensorFlow or PyTorch existed, but didn't open-source it until 2018—by which point it was irrelevant. Tang applies this lesson directly to DexBotics (open-sourced immediately in 2024).
"By the time we wanted to open-source it, it was already 2018. At that point, there was no longer any meaning to it. The lesson: open-sourcing needs to happen early." [01:33:32]
Avoid the "Project Company" Trap in B2B AI
Tang distinguishes between product companies and project companies. Each customization engagement destroys scalability. The key is building modular, configurable (not custom) solutions and ensuring ROI is genuinely calculable—not a PPT fiction.
"Once every project requires custom development for the client, the scalability of your business has very serious problems." [00:32:11]
6. Overlooked Insights
Competitive Programming Training as a Talent Filter + Network Moat
Tang briefly mentions that his role as the national NOI/IOI head coach from 2007–2013 gave him early visibility into China's most exceptional young technical minds. This allowed Megvii to recruit Fang Haoqiang and Zhao Erjing as teenagers—before any other company could identify them. This was not a recruitment strategy in any conventional sense; it was a byproduct of volunteer academic service. The result: a decade-long pipeline of exceptional, deeply loyal talent.
"He asked me: I've got nothing to do now, what should I do? I said, we just started a company—why don't you come work here? So he joined in 10th grade." [00:26:18]
This suggests a non-obvious talent acquisition strategy: investing in pre-professional talent pipelines (olympiad coaching, academic competitions, university clubs) can generate compounding loyalty and capability advantages that pure recruiting cannot replicate. No other company in the embodied AI space appears to have this specific structural advantage.
The "Chicken-and-Egg" Data Flywheel Problem Is the Real Existential Risk—Not Model Performance
Tang briefly but powerfully names what he calls the "chicken-and-egg" problem: robots aren't mature enough to be deployed at scale → therefore they don't generate real-world deployment data → therefore the data flywheel never starts → therefore the competitive moat never forms. This isn't just a technical problem—it's a go-to-market and system design problem that most companies are not solving.
"Robots aren't mature, making them unusable at scale. And if they can't be used, there's no data from real failure cases and takeovers. This chicken-and-egg problem exists. We must find a way to let robots be deployed at scale in batches—only then does the resulting data become the most useful data." [01:14:22]
The implication: the company that solves the deployment infrastructure problem (tolerance systems, dispatch, human fallback) before the model is "ready" will be the one to capture the data flywheel first—and that advantage will compound in ways that pure model-focused labs cannot replicate.