Gemini Robotics team (Google DeepMind)

1. Key Themes

From Digital AI to Physical Agents: The Core Strategic Shift

Google DeepMind is explicitly repositioning Gemini from a text/code/data AI into a physical-world agent. The framing is not incremental — it is a full architectural rethink around robots that perceive, plan, and actuate in unstructured environments.

"Typically AI has existed within the digital realm. It has been amazing for things like editing text, analyzing massive data sets, or writing code. But the physical world is a whole other beast." 00:00:00

"DeepMind has been working to bridge that gap by using Gemini models to move from static chatbots to physical agents. We aren't just teaching machines to see the world, but we're enabling them to perceive it with advanced spatial awareness and understanding." 00:00:51

Open Vocabulary Object Detection as a Platform Unlock

The shift from closed-vocabulary vision models (YOLO trained on COCO/ImageNet) to VLM-based open vocabulary detection is framed as eliminating an enormous developer burden — no more building, labeling, and maintaining custom vision models per environment.

"With Gemini Robotics ER, we've moved to a vision language model architecture. The model uses semantic grounding to locate whatever you describe in natural language. Because language and vision are mapped in the same space, the model doesn't need a specific label for every object in your warehouse. This is called open vocabulary object detection. You can ask for the tool that looks like it's been used the most, or the component that is currently overheating." 00:03:04

"For developers, this is a massive win, as it eliminates the need to build, label, and maintain custom vision models for every unique environment." 00:03:47

Physical Common Sense and Embodied Reasoning as a New AI Category

The ER (Embodied Reasoning) model is positioned as a distinct category from general LLMs — it understands physics, weight, fragility, and structural relationships, not just pixels. This is presented as solving problems that were historically impossible to script manually.

"Gemini Robotics ER has what we call physical common sense. When it looks at a scene, it isn't just trying to identify the plate and the food, but it's reasoning about the relationship between them. It knows that to move a dish, it must first move the food into smaller containers that could be lifted or request human intervention. It also understands that a glass bottle is fragile, while a plastic one isn't." 00:04:32

"It's this kind of practical intuition that has historically been a nightmare to script manually." 00:04:32

Long Horizon Temporal Reasoning: Robots That Remember

Because Gemini Robotics ER 1.6 is built on the Gemini 3 Flash backbone and natively supports video/multi-image input, robots can now reason about sequences of events over time — enabling automated success detection that previously required complex hand-coded heuristics.

"Instead of asking, is the door open? You can pass a sequence of frames from the robot's journey through a facility. The model tokenizes these frames chronologically, it can reason about state changes. You can prompt the model with something like, look at the last 30 seconds of video. Did the gripper successfully secure the object or did it slip?" 00:05:59

"Now you can simply use the model as a temporal supervisor that watches the robots work and confirms that the physical state of the world has actually changed in the way that you intended." 00:06:47

Vision Language Action (VLA) Models: Closing the Perception-to-Motion Loop

The flagship Gemini Robotics VLA model maps camera pixels and natural language directly to motor values in real time — going beyond planning into actual physical control. The basketball slam dunk example illustrates zero-shot generalization to unseen tasks.

"These models map camera pixels and natural language instructions directly to blocks of motor values. You give it a prompt like clean up the desk and the VLA streams camera frames to determine exactly how the actuators should move to get the job done." 00:11:15

"When we gave a robot a small basketball and a net game and asked it to do a slam dunk, the model was able to take the ball and place it through the hoop despite not being previously trained for this." 00:11:59

Embodiment-Agnostic Design: A Universal Robot Brain

Rather than building for one hardware platform, Google DeepMind is explicitly targeting embodiment-agnostic models — humanoids, quadrupeds, mounted bi-arm setups — positioning Gemini Robotics as a general-purpose brain layer across the hardware ecosystem.

"Whether you are working with a humanoid, a quadruped, or a mounted bi-arm setup, our goal is to meet you where you are. These models are designed to be embodiment agnostic, providing a powerful, general purpose brain that can be adapted to whatever hardware you're working with." 00:12:48

Layered Safety Architecture Grounded in Real-World Injury Data

Safety is built using the "Swiss cheese model" with multiple stacked layers across semantic, physical, and operational dimensions. Critically, the Asimov Safety Benchmarks are grounded in actual hospital injury data from the NEISS database — not theoretical constructs.

"We've introduced the Asimov Safety Benchmarks. These aren't just theoretical. They are grounded in reality using the NEISS, or National Electronic Injury Surveillance System database. This contains real-world injury reports from hospitals to teach the model about physical common sense." 00:16:40

2. Contrarian Perspectives

Traditional Control Theory Is Not Dead — AI Is Not Always the Answer

Against the hype narrative that AI replaces everything, Paul Rees explicitly defends classical robotics control theory for high-repetition structured tasks. This is a rare admission of bounded applicability from someone building the AI alternative.

"In traditional robotics, if you have a machine in a factory repeating the exact same motion 10,000 times a day, standard control theory works just fine." 00:10:29

The Real Bottleneck in Robot AI Is Not Vision — It's Vague Human Language

Most robotics discourse focuses on perception or manipulation as the hard problem. Rees argues that ambiguous human prompting — not environmental complexity — is actually one of the hardest unsolved challenges.

"One of the hardest things for an AI-backed robot to navigate isn't necessarily a messy environment, but rather a vague human prompt. If we tell a robot, hey, put this away, it's suddenly trying to figure out what is this and where is away." 00:05:16

Safety Benchmarks Built on Theoretical Principles Are Insufficient

The conventional approach to robot safety relies on ISO standards and academic benchmarks. Google DeepMind's position is that only grounding safety in real documented human injuries makes benchmarks meaningful — a direct challenge to the field's standard practices.

"These aren't just theoretical. They are grounded in reality using the NEISS, or National Electronic Injury Surveillance System database. This contains real-world injury reports from hospitals to teach the model about physical common sense. Second, we ground our work in established industrial ISO standards." 00:16:40

Agentic Vision — Letting the Model Fix Its Own Bad Inputs — Is More Important Than Better Sensors

Rather than solving input quality with better hardware, the ER model generates intermediate code to manipulate and correct its own visual inputs before reasoning. This inverts the assumption that sensor quality is the constraint.

"By enabling the code execution tool, the ER model can now generate code for intermediate steps, allowing the model to manipulate those images itself to get a better understanding of content... the Gemini Robotics ER model does one more step with its generated code to rotate the intermediate image into a more readable orientation." 00:08:20

3. Companies Identified

Google DeepMind

The AI research division of Google responsible for the Gemini Robotics suite. Mentioned as the developer of Gemini Robotics ER 1.6, the VLA flagship model, the Asimov Safety Benchmarks, and the full robotics developer platform including the Python SDK, AI Studio, and Trusted Tester program.

"DeepMind has been working to bridge that gap by using Gemini models to move from static chatbots to physical agents." 00:00:51

Mujoco

Physics simulation engine, now integrated into Google's AI Studio browser environment for prototyping robotic control without physical hardware risk.

"We've actually created a few web app templates to show off what's possible, like integrating the Mujoco simulation engine directly into the browser, where we're running a virtual robotic arm that uses the Gemini Robotics ER model to detect block locations, then performs a pick-and-place task." 00:15:05

YOLO (Ultralytics)

Mentioned as the incumbent closed-vocabulary computer vision model that Gemini Robotics ER supersedes for open-vocabulary robotic perception tasks.

"If you've worked with computer vision before, you've probably used a model like YOLO that was trained on fixed data sets like COCO or ImageNet. Those are fantastic for what they are, but they are closed vocabulary, meaning they can only identify a predefined list of objects." 00:02:24

ESMT

Semiconductor manufacturer whose chip is used as a concrete example of the agentic vision / text-reading capability of Gemini Robotics ER in a production line context.

"Say you have a production line that makes circuit boards, and one of the steps involves taking the unique ID off of a certain chip to record it, such as the ESMT chip in this image." 00:07:31

4. People Identified

Paul Rees

Developer Relations Lead for Robotics at Google DeepMind. Self-described maker with background spanning gardening, woodworking, IoT, and complex machines. Serves as the primary technical evangelist and developer-facing lead for the Gemini Robotics platform, presenting the full capability stack at Google I/O 2026.

"I'm Paul Rees and I'm the Developer Relations Lead for Robotics at Google DeepMind. I would generally describe myself as a maker, whether it's something low-tech like gardening and woodworking, or more advanced projects involving the Internet of Things and complex machines." 00:00:00

5. Operating Insights

Use the Model as a Temporal Supervisor, Not Just a Perception Engine

Operations teams deploying robots should replace brittle heuristic success-detection scripts with the ER model as a continuous video supervisor. Instead of writing complex code to verify task completion, pass rolling video windows and query the model directly about state changes.

"In the past, you'd have to write complex heuristic code to check if a task was completed. Now you can simply use the model as a temporal supervisor that watches the robots work and confirms that the physical state of the world has actually changed in the way that you intended." 00:06:47

Prototype in Simulation Before Touching Hardware — Fail Fast, Fail Safe

AI Studio with the Mujoco integration allows teams to test perception logic and orchestration against a browser-based virtual robot arm before any physical deployment, dramatically reducing iteration cost and risk of hardware damage.

"This lets you rapidly prototype prompts and test how the model perceives images from your specific hardware without having to constantly re-flash or reload scripts on the robot... really lending itself to a fail-fast, fail-safe strategy." 00:15:05

Stack Safety in Layers — Never Rely on a Single Guardrail

For any operator deploying autonomous physical agents, the key design principle from Google DeepMind's own safety architecture is that no single safety layer is sufficient. Semantic, physical, and operational safeguards must all be present simultaneously.

"We like to think of it as the Swiss cheese model of defense. No single layer is a perfect barrier, but by stacking multiple layers of safeguards, spanning the semantic, physical, and operational aspects of safety, we can effectively mitigate risk." 00:16:40

6. Overlooked Insights

The Gemini Robotics VLA Is in Trusted Tester Only — This Is a Meaningful Moat Signal

The flagship Vision Language Action model — the one that actually converts camera pixels to motor commands in real time and demonstrated zero-shot slam dunk generalization — is not publicly available. It is gated behind a Trusted Tester program. This was mentioned only in passing, but it signals that Google is deliberately controlling who gets access to the most powerful layer of the robotics stack. For investors and competitors, this matters: the perceive and plan tooling (ER 1.6) is open to developers, but the actuation intelligence that makes fully autonomous robots possible is being rationed. Whoever gets into that program early gains a compounding hardware + software integration advantage that will be very hard to replicate.

"These models map camera pixels and natural language instructions directly to blocks of motor values... the flagship Gemini Robotics model which is currently available through our Trusted Tester program." 00:11:15

Real Hospital Injury Data Is Now a Training Signal for Robot Safety — This Has Regulatory Implications

The Asimov Safety Benchmarks use the NEISS database — a federally maintained surveillance system of actual emergency room injury reports — as training and evaluation data for robot physical common sense. This was mentioned in a single breath, but it is significant: it means Google DeepMind is operationalizing real human harm data as a feedback loop into model behavior. This is a methodological precedent that could become the de facto standard for regulatory approval of autonomous robots in commercial and consumer environments, giving Google an early mover advantage in shaping what "certified safe" means for the industry.

"These aren't just theoretical. They are grounded in reality using the NEISS, or National Electronic Injury Surveillance System database. This contains real-world injury reports from hospitals to teach the model about physical common sense." 00:16:40