Manqi | 晚点聊 LateTalk (Investigative Journalism) Summary

1. Key Themes

Transitioning from Research to Entrepreneurship: The Importance of Organizational Capability

Cao Yue's journey from a top-tier researcher at Microsoft Research Asia (winning ICCV Best Paper) to entrepreneur reveals a critical insight: recognizing that breakthrough innovations require not just technical excellence, but strong organizational capabilities. This realization came when studying OpenAI's work in 2021.

"When I saw OpenAI's DALL-E and CLIP in early 2021, I was very surprised... The real question was why they could make such work and what their methods, mindset, and organizational structure were." [00:14:00]

His decision to join Zhiyuan Research Institute and later start his own company stemmed from understanding that:

"It's about how you design a scalable system that can maximize the use of computing power... When you really need to build a scalable system, you need very diverse people - some to crawl data, some to clean data, some to train models, some to optimize efficiency." [00:17:30]

The Evolution of AI Video: From Single-Shot Generation to Narrative Capability

Cao Yue identifies a critical shift in AI video technology - from generating individual clips to creating coherent narratives. He sees Sora 2's most impressive feature as its ability to generate 10-second videos with basic storytelling through multiple camera angles, not just its audio-visual capabilities.

"The third point I think is most critical - within about 10 seconds, it can have basic storytelling... It's not just that it can switch shots, but making those cuts feel narrative-driven, making it something ordinary people would want to consume." [00:53:00]

However, he notes this requires fundamental organizational capability:

"This is about product needs being able to feedback to the model team... They need to define what narrative short films mean, establish a benchmark for what capabilities to achieve, then through model-side data training and optimization, actually give the model this capability." [01:02:00]

China's Innovation Gap: Moving from Efficiency to Original Innovation

When asked why China hasn't produced an organization like OpenAI, Cao Yue shares Wang Huiwen's insight that fundamentally changed his perspective:

"When I asked Wang this question, he quickly gave me an answer - because domestic internet companies' development stage means we're not rich enough... When you're in a catch-up phase, there's always a target ahead. You just need to catch up faster with efficiency innovation." [00:28:00]

This shapes his view that China is transitioning from a phase where efficiency and business model innovation were sufficient, to one requiring original innovation:

"As you get closer to the frontier in various industries, the direction becomes less clear... When there's no one significantly better ahead of you, your Mindset needs to change at every level - from investors to entrepreneurs to society's tolerance for failure." [00:29:00]

2. Contrarian Perspectives

Product-Led vs. Model-Led: Rejecting the "Model Drives Product" Dogma

Cao Yue challenges the dominant "model drives product" philosophy that was prevalent in early AI startups:

"In the early stage, people might say product operations shouldn't be too fancy... but I think this has some historical context. When models were in basic iteration stage, excessive product intervention might not be good for the model. But at a certain stage, whether your product is user-friendly has many areas to iterate on." [01:33:00]

His contrarian take: truly successful AI products require deep vertical integration where product insights drive model development priorities, not just showcase model capabilities:

"It's like when product says 'we need this feature now as highest priority,' the model team should immediately align on the importance of this feature. This is fundamentally about organizational change." [01:34:00]

Human Performance is Not the Limiting Factor for Character ID Preservation

While most companies focused on preserving facial features in AI video, Cao Yue identified that the real challenge is maintaining performance and voice:

"When a model can generate both visuals and audio, preserving character ID means you need to preserve not just their appearance but their voice... Before this, models only generated visual frames, so preserving character ID only meant preserving appearance without voice." [00:54:00]

His insight: body proportions and physical characteristics matter more than perfect facial recreation when people know the subject:

"When I used Sora's feature for people I know, sometimes the facial features are quite similar, but the height and body type... especially when there are two people I know together, their height and body proportions don't match reality. This is quite problematic." [01:35:00]

LLMs as Context Alignment Tools, Not Just Answer Engines

Cao Yue presents a rarely discussed use case for large language models - reducing communication friction between people with different backgrounds:

"The biggest problem between people is aligning context... When I express a viewpoint, I might have vast context and information, but compress it into a few dozen words. The receiver only gets those words without all the context... Language models are exceptionally strong at helping you bridge this gap." [01:46:00]

His team practices this systematically:

"When product colleagues send a message with several concepts requiring background knowledge that you don't have, the best approach is to screenshot it and send it to a language model, asking 'please explain what this person means.' I think many people in our organization use this method to compensate for communication gaps." [01:47:00]

3. Companies Identified

OpenAI - Organizational Excellence Through Vertical Integration

Cao Yue extensively studied OpenAI's organizational capabilities, particularly their approach to Sora 2:

"From a relatively technical perspective, I think generating a short film with narrative capability should be feasible under current technical conditions... The key is that when they had product needs, this goal could feedback to the model team, and model team says we can establish benchmarks to ultimately give the model this capability through optimization." [01:02:00]

What makes them special: End-to-end capability to translate product vision into model capabilities.

Zhiyuan Research Institute - China's Early Foundation Model Pioneer

Described as China's earliest organization embracing OpenAI's methodology and focusing on foundation models:

"Zhiyuan was one of the earliest organizations in China to embrace OpenAI's methodology and large models... It's also a new type of research institution, so essentially you could consider it as not having publications as core metrics... They very early proposed building large computing clusters - around early 2022, Zhiyuan had a 1,500 A100 GPU cluster." [00:20:00]

Why it mattered: Provided freedom from traditional academic constraints while maintaining research focus.

DeepSeek - Setting High Standards for Talent Acquisition

Used as a reference point for hiring philosophy in AI video startups:

"When we were hiring algorithm colleagues, our methodology was to recruit those graduating or within 3-5 years of graduation - people who are very hungry, still on the frontlines, with excellent capabilities. We didn't focus much on whether they previously did NLP or computer vision, because if their fundamentals and state are at a peak, learning new domains might be very quick, maybe just a few months." [00:33:00]

Demonstrates: Prioritizing learning velocity and fundamentals over domain-specific experience.

Pixar - Business Model Blueprint for AI Video

Wang Huiwen's insight about Pixar shaped Cao Yue's strategic thinking:

"If you want to do this direction, you can study Pixar. Pixar's business model is very good... First, they create movie pieces through graphics technology to generate box office. But after the movie is released, the character IP is maintained by Pixar, so you can sustainably monetize through IP peripherals. One aspect has film industry characteristics, another has IP industry characteristics, but it originated from having new technology." [01:52:00]

Key difference from live-action: Character IP remains with the studio rather than transferring to actors.

4. Operating Insights

Context Alignment as Core Infrastructure

Cao Yue's team systematically uses LLMs to bridge communication gaps between technical and product teams:

"In our organization, we tell everyone that models are very strong in this aspect... When product colleagues explain something with concepts you don't understand, screenshot it and send to a language model asking for explanation. This significantly reduces friction between people and helps them align faster." [01:47:00]

Implementation: Screenshot → LLM explanation → Faster alignment between algorithm and product teams.

Rapid Organizational Pivoting: From Model-Centric to Vertically Integrated

After releasing March One in April 2025, Shensi AI made a fundamental organizational shift:

"The core change is that we've now organizationally built up from the model system to the product R&D system to operations system and connected them. Most importantly, having more frequent communication - when you're from different backgrounds but need to work closely together, you can let these people sit closer physically, they'll naturally have more communication." [01:29:00]

Result: Product requirements now directly influence model training priorities with tight feedback loops.

The "Professional CEO" Framework from Wang Xing

Cao Yue applies Li Xiang's framework for systematic thinking:

"First, you need to think clearly about what industry your company is in, what are the trends in this industry, what's the development trend and pace... In AI video, what development stage are we at? At this stage, what are the core pain points? What solutions do you use to solve these pain points?" [01:40:00]

This framework helps move from intuition-based to methodology-based decision making.

Early Model Testing Through Internal Competitions

Before external release, Shensi AI validated product-market fit internally:

"Before Sora 2 came out, we were already discussing this - because it's really so natural. We even held competitions internally, giving awards in this space, asking everyone to compete. It has different award categories like 'most cinematic' and 'most hilarious' - its threshold is really very low." [01:18:00]

Validation: If it works as entertainment for technically sophisticated internal users, it has 2C potential.

5. Overlooked Insights

The "Ambition-Driven" Career Pattern Recognition

While Cao Yue discusses his career decisions extensively, he reveals a meta-insight about his own decision-making that's easy to miss:

"I discovered that I'm quite an ambitious person. This should be a very fundamental driver for me... If you want to achieve great results, your direction of effort should be to make yourself worthy of those results... One side is that when you can build deep understanding of things and fully train your capabilities, you then have the ability to create things of huge value for the world." [00:42:00]

Why this matters: He retroactively recognized that every major decision - from joining MSRA to Zhiyuan to entrepreneurship - was driven by this underlying ambition, not conscious career planning. This suggests successful technical founders may need to do similar introspection to understand their true motivations.

The Unspoken Timing Insight About 2C AI Video Products

Buried in the discussion about Sora 2's potential as a consumer product is a subtle but critical insight about market timing:

"Whether it can become a true consumer platform product, I think no one has the answer right now... The most critical thing is retention - what group of people's relatively long-term rigid needs does it actually satisfy?... But whether it can become a major consumer platform, I think this should be something no one can answer right now." [01:22:00]

What's overlooked: Even as Cao Yue plans a 2C product and identifies the technical capabilities are there, he's notably cautious about declaring this "the moment" for consumer AI video platforms. This suggests the sophisticated founders are hedging their 2C bets even as they pursue them - they see possibility but not certainty. The window may be opening, but it's not definitively open yet.

Timestamps are approximated based on content flow. For a complete timeline with precise timestamps, please refer to the full transcript markers provided in brackets [HH:MM:SS].