Jim Fan
Linxi "Jim" Fan is a Director of AI and Distinguished Scientist at NVIDIA, where he co-leads Project GR00T (humanoid robotics) and the GEAR (Generalist Embodied Agent Research) Lab. He holds a Stanford Ph.D. and was OpenAI's first research intern. Fan is best known for his work on embodied AI and physical robotics, including DreamDojo, an open-source interactive world model trained on 44,000 hours of human video that enables robots to learn from synthesized simulation without hand-authored physics engines.
“We reproduce this baseline using the GR00T N1.7 implementation and initialize from the pretrained nvidia/GR00T-N1.7-3B checkpoint (Appendix E). NVIDIA's GR00T N1.7 achieves 35% average success as EgoScale — the strongest baseline, but 30 points behind T-Rex.”
Source→“Jim Fan and Yuke Zhu's research taste and style matched mine quite well. At the time I also really wanted to collaborate with them.”
Source→“Dream Dojo is a relatively universal world model pretrain. We open-source it so that anyone with a new robot can quickly connect to our world model, fine-tune it, and use it.”
Source→“We pre-train Eagle Scale on 21k hours of in-the-wild egocentric human data with zero robot data whatsoever and during pre-training we predict these hand joints and wrist poses. Then in action fine-tuning we collect only 50 hours of high-precision mocap data and four hours of tele-op. That's four hours of tele-op, less than 0.1% of our training mix.”
Source→“UMI is perhaps one of the greatest papers ever written in robotics data and it spawned two unicorn startups. On the left-hand side is Journalists improving this design so you can wear the gripper here and on the right-hand side, Sunday made these three-finger data gloves.”
Source→“Anyone driving Tesla or Waymo here? Anyone? You know, when you're driving, you're actually contributing to the biggest physical data flywheel.”
Source→“No one can describe what happened next better than Ilya himself. If you believe in deep learning, deep learning will believe in you.”
Source→“Can you spot another? That's Andre right there. So Andre, we're going to the Computer History Museum.”
Source→“These are some clips from Veo 3. You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction, all by themselves. None of this is coded in.”
Source→“The last three years were dominated by VLAs or vision language action models, and models like Pi and Groot fall in this category... We call this new type of model World Action Models, or WAM. So let's all take a moment of silence for our dear friend VLAs. They've served us well, rest in peace, long-lived World Action Models.”
Source→“These are some clips from Veo 3. You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction, all by themselves. None of this is coded in.”
Source→AI-extracted from podcast / newsletter / paper summaries. May contain errors.