Do as I Do: Dexterous… | arXiv Physical AI Research Summary

Berkeley AI Research | arXiv:2606.19333 | June 2026

1. Key Themes

The Data Bottleneck for Dexterous Manipulation Is Solvable Without Teleoperation

The central bet of this paper is that the trillion-hour library of human video already online is a viable substitute for expensive teleoperation data — if you can bridge the embodiment gap. The authors build an end-to-end pipeline that ingests raw monocular RGB video (YouTube clips, egocentric footage, AI-generated video) and outputs physically executable trajectories for a 22-DoF dexterous robot hand. They demonstrate this across 20 distinct manipulation verbs (hammering, whisking, pouring, spreading, etc.) covering a range of grasp types.

"In total, our pipeline produced 500 high-quality, human-verified dexterous manipulation trajectories across internet (53%), egocentric (31%), and generated (16%) videos." — Section 4.4

This is the first claimed pipeline that completes the full loop from internet video to real dexterous hand rollouts, verified on hardware with bimanual UR3e arms and Sharpa Wave hands.

Reconstruction Quality Is the Critical Bottleneck — And It's Now Materially Better

Prior object tracking systems (FoundationPose, Any6D) fail under real-world conditions: motion blur, occlusion, and variable lighting cause pose lock loss that cascades through the entire pipeline. The paper's core technical contribution is repurposing SAM 3D — a generative 3D foundation model — as a temporally coherent video tracker using guided diffusion sampling. The result is a meaningful reconstruction quality improvement.

"Human raters prefer our object tracking over the state-of-the-art FPose 67% of the time, with most videos receiving unanimous preferences... 75% of videos received unanimous agreement across all three raters, and inter-rater agreement was substantial (Fleiss' κ = 0.65)." — Section 4.2 and Appendix C

On standardized benchmarks (DexYCB and HOI4D), they establish new state-of-the-art on F-5, F-10, and Chamfer distance metrics against all prior joint reconstruction and object tracking baselines. Practically: your pipeline doesn't fall apart when a human hand partially covers the object mid-motion.

Retargeting Noisy References Is a Distinct Hard Problem — And It Requires Physics

Most prior retargeting work assumes clean MoCap ground-truth poses. This paper explicitly targets the harder, more realistic case: noisy reconstructed references with temporal discontinuities and depth ambiguity. The three innovations (warmup steps, random force perturbation, transition reward) together take retargeting success rate from 25% to 71% on reconstructed in-the-wild data.

"On our reconstructed in-the-wild data, Do as I Do reaches a 71% success rate, significantly improving over the baseline of 25%. The main differentiator is warmup, which discovers initial states that are much more stable and natural than the noisy initial frame." — Section 4.3

Crucially, these same innovations also improve performance on clean MoCap data (72% → 81% on OakInk2), suggesting the techniques generalize and aren't just patching reconstruction artifacts.

Internet Video Is Noisier Than You Think — A 20x Data Penalty If Unfiltered

The paper includes a sobering empirical audit of the 100DOH dataset, one of the most-used internet hand-object interaction datasets. Despite being pre-filtered for hand-object interaction, only 4% of sampled clips survive the full quality pipeline for robot learning readiness.

"Out of the 2,000 videos sampled from 100DOH, only 83 (4%) survive our quality check for the reconstruction pass. Even in the best case, we foresee 107 clips, or roughly 5% of the data, being directly relevant for learning dexterous manipulation, implying a 20× penalty in not properly preprocessing and filtering internet videos for robot learning." — Section 4.5

This is a direct operational warning for any team assuming raw internet video scales cleanly into training data.

Generated Video Is a Viable Third Data Source

The pipeline explicitly supports AI-generated video as an input source and 16% of the final 500 trajectory dataset came from generative video models. This creates a closed loop: generative video can be conditioned on specific objects, environments, or motions that are hard to find in the wild, then fed directly into the reconstruction-retargeting pipeline to produce robot trajectories.

"Do as I Do reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources... ranging from in-the-wild internet clips to outputs of generative video models." — Section 1 (Contributions)

2. Contrarian Perspectives

Teleoperation Is Not the Scalable Path for Dexterous Hands — Observation Is

The dominant industry assumption is that dexterous manipulation data requires purpose-built teleoperation hardware (exoskeletons, gloves, motion capture rigs) operated by trained specialists. This paper directly challenges that assumption by demonstrating that a consumer video of someone whisking eggs can be converted into a physically executable robot trajectory — no special hardware, no operator expertise required.

"Teleoperation is bottlenecked by operator expertise, cost of operation, and mechanical transparency of the teleoperation rig. Exploration in simulation is similarly bottlenecked by complexities in designing diverse environments and reward functions." — Section 1

The retargeting success rate of 71% on reconstructed references is imperfect, but at internet scale with proper filtering, this could yield more diverse and cheaper data than teleoperation at scale. The industry has not yet seriously stress-tested this assumption.

Kinematic Retargeting Without Physics Is Structurally Broken

Several competing approaches (DexMan, DexImit, VideoManip) use kinematic solvers or motion planning for the retargeting step. This paper argues that approach is fundamentally inadequate because it ignores contact forces, causing penetration, fingertip sliding, and grasp instability.

"Kinematic retargeting approaches... operate solely on the robot configuration and do not account for forces between hand and object, often causing penetration, fingertip sliding, and grasp instability." — Section 2

The ablation data supports this: the dynamics-unaware baseline (Annealed Sampling alone) achieves only 25% success on reconstructed references. Adding physics-aware components gets to 71%. Any team using kinematic-only retargeting should treat their success metrics skeptically.

Foundation Models for Reconstruction Beat Specialized Trackers in Real-World Conditions

The conventional assumption is that specialized 6-DoF pose trackers (FoundationPose, Any6D) — built specifically for this problem — should outperform generative models repurposed as trackers. The paper inverts this: SAM 3D, a generative image-to-3D model not designed for tracking, outperforms purpose-built trackers on in-the-wild video because it was trained with occlusion robustness built in.

"Prior works tend to lose pose lock, drift, or fail to re-acquire the object once visual evidence degrades... Therefore, we develop our own tracking method based on 3D generative foundation models trained with occlusions, namely SAM 3D." — Section 3.1

The practical implication: the field of specialized pose tracking may be disrupted by generative foundation models that learn richer priors over object shape and appearance.

3. Companies Identified

NVIDIA Developer of Isaac Sim and MuJoCo Warp (GPU-accelerated physics). The retargeting pipeline runs on MuJoCo Warp for parallel sampling-based optimization. NVIDIA's GPU infrastructure is load-bearing for making the optimization tractable at scale.

"GPU-parallel physical simulators such as Mujoco Warp and Isaac enable fast sampling-based optimization algorithms that we can use to infer robotic dexterous hand actions in minutes from 4D hand-object states." — Section 1

Google DeepMind Co-developer of MuJoCo Warp alongside NVIDIA. Also authors of BootsTAPIR (point tracking) used for adaptive pose guidance in the reconstruction pipeline.

"Google DeepMind and NVIDIA. Mujoco Warp (MJWarp)." — Reference [13]

Meshy AI 3D model generation company whose MeshyAI product is used by competing pipelines (VideoManip). The paper implicitly positions SAM 3D as a superior alternative for object mesh generation in manipulation pipelines.

"VideoManip: MeshyAI + FPose, DRO + DP3" — Table 1

Sharpa (Sharpa Wave Hand) The hardware platform used for all real-world deployment experiments in this paper. 22-DoF dexterous hand deployed on bimanual UR3e setup. The paper's real-world results directly validate Sharpa's platform for research-grade dexterous manipulation.

"Across all tasks, we use the 22-DoF Sharpa Wave hand. Real-world deployment results shown here are on a bimanual setup with Sharpa Wave hands and UR3e arms, both commanded at 50 Hz." — Section 4.1

Universal Robots (UR3e) Arm hardware used in the bimanual real-world deployment setup. Standard collaborative robot arm integrated with the Sharpa dexterous hand.

"Real-world deployment results shown here are on a bimanual setup with Sharpa Wave hands and UR3e arms." — Section 4.1

Kyutai Provided compute resources for this project. Relevant as an AI research infrastructure player supporting frontier robotics research.

"We thank Kyutai for providing us with the compute resources for this project." — Acknowledgments

Amazon Pieter Abbeel holds a concurrent appointment as Amazon Scholar. No direct product involvement, but signals Amazon's continued investment in dexterous manipulation research talent.

"Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar at Amazon." — Acknowledgments

4. People Identified

Pieter Abbeel UC Berkeley / Amazon Scholar. One of the most cited researchers in robot learning, foundational work on imitation learning and reinforcement learning for robotics. His presence on this paper signals institutional weight and connects to the broader learning-from-video research agenda he has championed.

"Pieter Abbeel holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar at Amazon." — Acknowledgments

Jitendra Malik UC Berkeley. Legendary computer vision researcher, co-inventor of foundational vision algorithms. His involvement in this paper — alongside the SAM 3D and HaWoR tools developed in his extended research group — reflects a deepening convergence between classical vision and physical robot learning. Malik appears on multiple cited works in this paper including HaWoR and SAM 3D.

Referenced as co-author on HaWoR [45]: "Pavlakos, Shan, Radosavovic, Kanazawa, Fouhey, and Malik. Reconstructing hands in 3D with transformers." — Reference [12]

Nur Muhammad "Mahi" Shafiullah UC Berkeley. Rising figure in robot learning, known for work on generalist robot policies and learning from diverse data. Co-equal contributor on this paper, representing the robot learning perspective.

Listed as corresponding author group alongside Bhawna Paliwal — Abstract

Bhawna Paliwal, Haritheja Etukuru, William Liang UC Berkeley, equal contributors. NSF Graduate Research Fellowship holders (Etukuru and Liang). Liang has prior work on visual reward representations (LIV) and EgoZero (robot learning from smart glasses), suggesting a research program focused on learning robot skills from passive human data sources.

"Haritheja Etukuru and William Liang are supported by the NSF Graduate Research Fellowship Program under Grant DGE 2146752." — Acknowledgments; Liang cited in EgoZero [9]

Chaoyi Pan Cited for foundational work on SPIDER (Scalable Physics-Informed Dexterous Retargeting), the state-of-the-art retargeting method that this paper builds on and extends. Pan's SPIDER framework is the baseline that Do as I Do improves upon significantly (25% → 71% success).

"We thank Chaoyi Pan for guidance and insightful discussions on retargeting." — Acknowledgments; Pan et al. [15] is the primary retargeting baseline throughout

5. Operating Insights

Filter Your Internet Video Before You Build Anything Else

Teams sourcing training data from internet video datasets should build aggressive preprocessing pipelines before investing in reconstruction or retargeting infrastructure. The paper's audit of 100DOH shows that even a pre-filtered hand-object interaction dataset yields only 4-5% usable clips for dexterous manipulation learning. The specific failure modes identified are actionable: shots with hands or objects at frame boundaries (22% of candidates), no/cross-boundary activity (15%), camera motion (7%), and model failures (5%). A CTO building a data flywheel should design filtering as a first-class system component, not an afterthought.

"Out of the 2,000 videos sampled from 100DOH, only 83 (4%) survive our quality check for the reconstruction pass... implying a 20× penalty in not properly preprocessing and filtering internet videos for robot learning." — Section 4.5

Physics Simulation Quality Is Now an Upper Bound on Real-World Performance

For teams running sim-to-real transfer for dexterous manipulation, the paper explicitly identifies physics simulation fidelity as a hard ceiling on achievable real-world performance. This is not a peripheral concern — it directly limits the quality of trajectories generated from the retargeting pipeline, and by extension, the quality of any policy trained on that data. Teams should be actively tracking improvements in GPU-parallel physics engines (MuJoCo Warp, Isaac) and building evaluation frameworks that measure the sim-to-real gap for contact-rich manipulation specifically.

"The current physics simulators model the real world dynamics only approximately, which places an upper bound on the achievable real-world performance of our framework." — Section 5 (Limitations)

The Warmup Pattern Is Generalizable Infrastructure

The "warmup step" introduced for retargeting — where the object is fixed and the robot hand repositions before the main trajectory begins — is a conceptually simple but operationally powerful pattern. It prevents the optimizer from getting trapped by bad initial conditions caused by noisy reconstruction, and it does so without any task-specific heuristics or grasp priors. Any team building trajectory optimization pipelines for dexterous manipulation from imperfect references (reconstructed, predicted, or transferred) should adopt this pattern.

"We introduce additional H warmup steps prepended to the reference. During warmup, the object is held in place (e.g., in mid-air) while the robot hand is free to move... This allows the robot to adjust its pose before tracking the reference, and naturally guides the optimizer in maximizing its tracking objective. Crucially, this warmup design does not assume any grasp sampling or heuristics." — Section 3.2

6. Overlooked Insights

Generated Video as a Data Source Has Structural Advantages Over Internet Video — But Nobody Is Measuring It

The paper includes generated video as 16% of its final trajectory dataset without extensively analyzing whether generated video produces higher or lower quality trajectories than real internet video. This is an important gap. Generated video can be conditioned on specific objects, lighting, camera angles, and motions that address exactly the filtering failures observed in internet data (frame boundary issues, shot cuts, camera shake). If generative video models produce cleaner hand-object interaction than found footage — which is plausible given they can be prompted to avoid the specific failure modes — then the optimal data strategy inverts: generate first, scrape second. No one has published a controlled comparison of trajectory quality by video source type. This paper creates the infrastructure to run that experiment.

"Do as I Do reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources... ranging from in-the-wild internet clips to outputs of generative video models." — Abstract; "internet (53%), egocentric (31%), and generated (16%) videos" — Section 4.4

The Clustering-Based Pose Selection Is 30x Faster Than the Principled Alternative — With No Quality Loss

Buried in the appendix is a finding with significant deployment implications: the paper's clustering-based heuristic for selecting the best pose candidate among N=25 samples matches the theoretically principled log-likelihood ranking approach on both DexYCB and HOI4D benchmarks, while being up to 30x faster. The principled approach requires ~8,700 forward+backward passes through the diffusion backbone per frame, making it computationally prohibitive at video scale. The clustering approach adds essentially zero cost over generation. For any team trying to run this pipeline at scale (thousands of videos), this is not a footnote — it's the difference between tractable and intractable compute costs.

"Clustering-based selection performs on par with pose-likelihood selection while being up to 30× faster." — Section 3.1; "Evaluating the trace exactly requires... roughly two orders of magnitude above generation itself and prohibitive at video scale. As a result, we go with a clustering based heuristic which is almost real-time once the candidates have been generated." — Appendix A