Sumo: Dynamic and… | arXiv Physical AI Research Summary

1. Key Themes

Beating Physics Limits with Whole-Body Coordination

The headline result is a Spot quadruped uprighting a 15 kg tire — exceeding the robot's rated 11 kg arm payload — with 10/10 success in under 10 seconds average. This isn't a marginal improvement over spec; it's a qualitatively different capability class. As the paper states: "Sumo enables the robot to complete the task 10 out of 10 trials with an average completion time of 9.2 ± 4.7 s" (Section V-A, Tire Upright). The key insight is that the robot uses arm, torso, and legs in coordinated combination — not just the arm in isolation — to exceed what any single actuator could accomplish alone.

Hierarchical Control as the Unlock for Generalization Without Retraining

The paper's core architectural bet is that separating "how to move" (RL-trained low-level policy, fixed) from "what to do" (online sample-based MPC, updated in real time) enables generalization that neither approach achieves alone. Critically, this works zero-shot: "our method generalizes to a diverse set of objects and tasks with no additional tuning or training" (Abstract). In simulation benchmarks, Sumo achieves ≥80% success across all five tested objects, while end-to-end RL degrades sharply on complex geometries like tires and tire racks (Figure 4, Section IV-B).

Order-of-Magnitude Faster Iteration Than RL-Based Approaches

The compute efficiency story is strategically important. The paper reports: "Both methods achieve similar asymptotic performance, but Sumo reaches the same performance with an order-of-magnitude less compute time" compared to hierarchical RL (Section IV-D, Figure 6). RL tuning is measured in GPU-hours on an NVIDIA RTX A6000; Sumo's hyperparameter search runs on a desktop CPU. For teams deploying robots in dynamic real-world environments where task requirements change frequently, this is the difference between day-long iteration cycles and multi-day GPU training runs.

Real-World Task Diversity Across Eight Scenarios at High Success Rates

Unlike most manipulation papers that demonstrate 1-2 tasks, SUMO reports 8 distinct real-world Spot tasks with success rates ranging from 8/10 to 10/10: uprighting tires, barriers, cones, and chairs; stacking tires; dragging barriers and tire racks; and pushing a heavy box. Objects span 3.5 kg to 20 kg, with highly variable geometry and friction. "These tasks stress three recurring difficulties: objects that are larger than the robot or heavier than the arm payload, contact conditions and geometries that create large sim-to-real gaps, and distinct manipulation modes such as uprighting, stacking, dragging, and pushing" (Section V-A).

Cross-Embodiment Transfer of the Same Framework

The identical framework transfers to a Unitree G1 humanoid in simulation with minimal modification — using an off-the-shelf locomotion policy not trained for manipulation. "While this policy is not explicitly trained for loco-manipulation and does not take in the desired arm commands, we find that simply overriding the arm commands with target commands from the high-level sampling-based MPC is also effective" (Section III-B). Four humanoid tasks (box pushing, chair pushing, door opening, table pushing) achieve 8–10/10 success. This suggests the architecture is robot-agnostic, not just a Spot-specific solution.

2. Contrarian Perspectives

End-to-End RL for Manipulation Is Hitting a Geometry Wall

Conventional wisdom in physical AI — reinforced by high-profile demos from labs like CMU, Berkeley, and Stanford — is that sufficiently trained RL policies with domain randomization will generalize to new objects. SUMO challenges this directly. Their E2E RL baseline was trained with PPO across 4,096 parallel environments for 5,000 iterations (roughly 2 GPU-hours) with 15 reward terms and size/weight/friction randomization, yet: "a hierarchical RL policy trained to move a 1.5 kg box to a goal does not generalize reliably once object geometry and weight move beyond its training distribution, even after randomizing the box size, weight, and friction" (Section IV-C). Spot RL achieves 100% on the training box but degrades sharply on tires and tire racks. The implication: for tasks involving geometrically complex or heavy objects, more RL training may not be the right answer.

You Don't Need Foundation Models or Human Demonstrations for Hard Manipulation Tasks

The dominant narrative in robotics right now is that scaling imitation learning from human demonstrations (diffusion policy, ACT, etc.) is the path to generalist manipulation. SUMO never uses a single human demonstration. It instead argues: "research on loco-manipulation has largely focused on learning from human demonstrations through teleoperation or video imitation, which are usually limited to quasi-static table-top settings and fails to leverage the passive dynamics of the object or the robot" (Section I). SUMO's approach — a fixed locomotion RL policy plus online planning — achieves 80–100% success on physically demanding tasks that imitation learning has not demonstrated at comparable scale or object diversity.

Reward Engineering Is the Bottleneck, Not Data — and MPC Bypasses It

The standard response to RL's generalization failures is "better reward shaping" or "LLM-generated rewards." SUMO shows this is often unnecessary when you switch paradigms. Compared to E2E RL's 15 reward terms per task (requiring expert iteration), SUMO uses only 3 reward terms to achieve comparable or better performance: "Sumo achieves similar or better success with only 3 reward terms and no task-specific retraining or tuning, whereas E2E RL requires 15 reward terms and about 2 hours of GPU compute per task" (Section IV-B). The contrarian implication: the reward engineering loop that consumes so much robotics engineering time may be a symptom of using RL where online planning would be simpler.

3. Companies Identified

Boston Dynamics

Description: Manufacturer of the Spot quadruped robot used as the primary hardware platform throughout the paper
Why relevant: SUMO is a direct capability expansion for Spot deployments. The paper demonstrates that Spot can manipulate objects well beyond its rated 11 kg arm payload and handle tasks (tire stacking, barrier dragging) not previously demonstrated autonomously. Boston Dynamics' customers — construction, public safety, industrial inspection — will care about these capabilities. The paper also reveals platform limitations: MoCap-dependent state estimation currently restricts lab-only deployment.
Quote: "We demonstrate the capabilities of our approach through a variety of challenging loco-manipulation tasks on a Spot quadruped robot in the real world, including uprighting a tire heavier than the robot's nominal lifting capacity of 11 kg" (Section V-A)

Unitree Robotics

Description: Manufacturer of the G1 humanoid robot used for simulation experiments
Why relevant: SUMO demonstrates that the framework transfers to the G1 using only the standard off-the-shelf velocity-tracking locomotion policy from MJLab — no custom training required. This is a strong signal for G1 operators: they may be closer to dynamic loco-manipulation capability than they realize, using existing policy infrastructure.
Quote: "For the G1 humanoid robot, we use the standard velocity-tracking policy from MJLab that takes in the desired torso velocity commands. While this policy is not explicitly trained for loco-manipulation... we find that simply overriding the arm commands with target commands from the high-level sampling-based MPC is also effective" (Section III-B)

RAI Institute (Robot AI Institute)

Description: The primary institutional home of the research; an applied robotics AI research organization
Why relevant: This paper is a direct output of RAI Institute, and the author list (17 researchers) suggests significant institutional investment in the loco-manipulation problem. The institute is positioning itself at the intersection of sim-to-real RL and online planning — a technically differentiated space. "This work was done in part during an internship at the RAI institute" (Acknowledgments).
Quote: Project hosted at sumo.rai-inst.com; code and benchmarks are open-sourced, signaling intent to build community around this approach

Google DeepMind / NVIDIA (MuJoCo Warp)

Description: Joint developers of MuJoCo Warp, a GPU-accelerated physics simulator referenced in the paper
Why relevant: SUMO runs its policy-in-the-loop rollouts on CPU MuJoCo (not GPU), achieving 43 ms for 32 parallel rollouts over 1.5 seconds — fast enough for 20 Hz control. The existence of GPU-accelerated MuJoCo Warp suggests the next generation of this approach could run significantly faster, enabling longer planning horizons or more rollouts.
Quote: "MuJoCo warp: gpu-optimized version of the mujoco physics simulator" (Reference [10])

4. People Identified

John Z. Zhang

Lab/Institution: Carnegie Mellon University (Robotics Institute) + RAI Institute
Why notable: Lead author; has a track record spanning sim-to-real locomotion, motion imitation (SLoMo), and contact-rich MPC. His prior work on whole-body MPC with MuJoCo (Reference [47]) directly enables SUMO's policy-in-the-loop architecture. Zhang is one of the more technically broad researchers in the loco-manipulation space — comfortable at the intersection of optimization, RL, and hardware.
Quote: First author on the paper; prior cited work includes "Whole-body model-predictive control of legged robots with mujoco" and "Real-time whole-body control of legged robots with model-predictive path integral control" (References [1, 47])

Maks Sorokin, Jan Brüdigam, Brandon Hung

Lab/Institution: RAI Institute (co-equal first authors, denoted with *)
Why notable: The three starred co-authors represent the core implementation team. Brüdigam's background in trajectory optimization and Hung's prior work on sampling-based MPC (JUDO package, Reference [22]) are directly reflected in the system architecture. JUDO is cited as the open-source MPC package underlying this work.
Quote: "JUDO: a user-friendly open-source package for sampling-based model predictive control" (Reference [22], co-authored by Hung)

Xinghao Zhu

Lab/Institution: RAI Institute
Why notable: Lead author of ReLIC (Reference [50]), the Relic whole-body control policy that serves as SUMO's low-level controller for Spot. ReLIC's multi-limb coordination capability — enabling three-legged stable gait with the fourth leg as a manipulator — is what makes SUMO's tire uprighting possible. Without ReLIC, the high-level planner would have a less capable substrate.
Quote: "We use the Relic policy which is designed for multi-limb loco-manipulation on the Spot quadruped robot. In particular, the Relic policy enables stable gaits with only three legs and allows the robot to use the fourth leg as a manipulator in addition to its arm and torso" (Section III-B)

Zachary Manchester

Lab/Institution: Carnegie Mellon University (Robotic Systems Lab)
Why notable: Senior academic collaborator; leads CMU's work on contact-implicit MPC and trajectory optimization. His group's prior work (Fast Contact-Implicit MPC, Reference [9]; Trajectory Bundle Method, Reference [41]) provides the theoretical backbone for SUMO's planning layer. Manchester's lab is one of the few producing work that bridges mathematical rigor in optimal control with real hardware results.
Quote: Referenced in foundational prior work: "Fast Contact-Implicit Model-Predictive Control" (Reference [9])

Simon Le Cléac'h

Lab/Institution: RAI Institute
Why notable: Senior author and likely technical lead at RAI; co-author on Fast Contact-Implicit MPC and the MPPI-based whole-body control work that precedes SUMO. Serves as the bridge between the academic trajectory optimization community and deployment at RAI.
Quote: Co-author on "Real-time whole-body control of legged robots with model-predictive path integral control" (Reference [1]) and cited on fast contact-implicit MPC (Reference [9])

5. Operating Insights

The Right Architecture for Dynamic Tasks: Fix the Low Level, Plan the High Level

For engineering teams building manipulation systems on legged platforms, the key operational takeaway is architectural: invest heavily in a robust, generalist low-level locomotion policy once, then build all task-specific behavior on top through online planning rather than per-task retraining. The paper demonstrates this pays off in three ways simultaneously — reduced action space dimensionality, stabilized dynamics for planning, and simpler cost functions. "Planning in the command space of a pre-trained whole-body policy reduces the effective search space and stabilizes the robot's contact-rich dynamics... Sumo achieves similar or better success with only 3 reward terms and no task-specific retraining or tuning, whereas E2E RL requires 15 reward terms and about 2 hours of GPU compute per task" (Section IV-B). The operational implication: your per-task deployment cost drops dramatically once the low-level policy is solid.

State Estimation Is the Real Deployment Blocker, Not the Algorithm

The paper's honest limitation disclosure reveals where the engineering gap actually lives in 2025-2026 loco-manipulation systems: "We rely on a MoCap system for robot and object state estimation and are restricted to laboratory settings. Future work should explore incorporating fully onboard perception systems" (Section VI). Every impressive result in this paper — 10/10 tire uprighting, 9/10 barrier dragging — is conditioned on motion capture providing ground-truth object pose at 120 Hz. For operators evaluating commercial deployability, this is the critical gap: the planning and control stack is arguably solved; robust onboard 6-DoF object tracking in unstructured environments is not. Teams investing in perception pipelines for object state estimation (pose estimation + tracking) are building directly toward unlocking these capabilities at scale.

6. Overlooked Insights

The Sim-to-Real Gap Is Task-Dependent in Non-Obvious Ways — and MPC Manages It Better Than RL

Buried in the hardware results is a practically important observation about which physical properties break simulation fidelity most severely. The paper notes for tire stacking: "the friction coefficient between the rubber tires is high, which current simulators struggle to simulate accurately, causing an even larger sim-to-real gap" (Section V-A, Tire Stack). Yet SUMO still achieves 8/10 success. The reason this works is mechanistic: because the MPC is replanning at 20 Hz in real time using the actual robot and object state (from MoCap), it continuously corrects for model errors rather than executing an open-loop plan trained on imperfect simulation. "We believe real-time MPC as the high-level policy significantly accelerates this tuning loop compared to RL training" (Section VI). The overlooked implication for operators: for tasks with large, hard-to-model contact dynamics (rubber, deformable objects, wet surfaces), online replanning may be fundamentally more robust than any amount of domain randomization in offline RL training.

The Open-Source Benchmark Could Become the Standard for Heavy-Object Manipulation Evaluation

The paper contributes "an open-source benchmark and dataset focusing on loco-manipulation tasks with objects that require whole-body coordination due to their size and weight relative to the robots' physical limits" (Section I). This is mentioned briefly but is strategically significant. Current manipulation benchmarks (RLBench, MetaWorld, DexGraspNet) overwhelmingly focus on tabletop, quasi-static, or pick-and-place scenarios with lightweight objects. A benchmark explicitly designed around objects that exceed single-limb payload limits — and requiring dynamic whole-body coordination — would fill a genuine gap and could anchor evaluation standards for the next generation of industrial and field robotics deployments. Teams building in this space should engage with this benchmark early; being early contributors to a benchmark standard has historically conferred outsized influence on how a subfield evolves.