OpenHLM: An Empirical… | arXiv Physical AI Research Summary

1. Key Themes

Whole-Body Native Control Unlocks True Humanoid Capabilities

The paper demonstrates that decoupling upper and lower body controllers—treating the humanoid as a "wheeled dual-arm platform"—severely limits the robot's capabilities. By using a joint-based whole-body teleoperation interface, the robot can perform tasks that are structurally impossible for decoupled systems, such as squatting to reach low shelves or using a foot to press a pedal. As stated in Section 3.1, "Joint-based whole-body teleoperation is the only interface that completes all three tasks, reaching 80%–87% task progress at 10–12 footsteps per rollout," whereas decoupled control and VR 3-point teleoperation failed by construction on tasks requiring lower-body manipulation (Table 1).

Cross-Embodiment Transfer from Non-Humanoid Robots is Highly Effective

A major finding is that a Vision-Language-Action (VLA) model pretrained on static and wheeled dual-arm platforms (specifically Physical Intelligence's π0.5) transfers surprisingly well to a humanoid's full action space. The authors found that "π0.5 reaches 91% average task progress, PaliGemma drops to 60%, and random initialization collapses to 42%" (Section 3.2, Figure 5). This suggests that the manipulation priors learned on simpler robot platforms are robust enough to bootstrap whole-body humanoid control, reducing the immediate need for massive humanoid-specific pretraining datasets.

Heterogeneous Co-Training Enables Cheap Data Scaling

The paper introduces a highly practical data strategy: using expensive whole-body teleoperation to teach core motions, and cheaper data sources (stationary teleop or robot-free HuMI rigs) to teach new objects and language instructions. The authors show that "co-training with cheaper manipulation-only sources... extends the policy to new objects and instructions without additional whole-body teleop" (Section 1). Specifically, on held-out tasks, co-training lifted task progress from 33% to 87%, nearly matching a 94% oracle that used full teleoperation on all tasks (Section 3.3, Figure 7).

Outperforming State-of-the-Art with Less Than Half the Data

In a long-horizon system-level comparison, OpenHLM achieved 87.5% task progress, significantly outperforming NVIDIA's GR00T N1.6 (57.5%) and another baseline VLA (48.8%). Crucially, OpenHLM achieved this using "less than half the total demonstration time" (Abstract, Section 4, Table 2). This proves that intelligent system design and data curation can beat brute-force data scaling.

2. Contrarian Perspectives

Mixing Humanoid Data into Pretraining is Not a Silver Bullet

The conventional wisdom in the industry is that to build a strong humanoid VLA, you must include humanoid data during the pretraining phase. This paper challenges that directly. Both baseline models (GR00T N1.6 and Ψ0) included Unitree G1 humanoid demonstrations in their pretraining, while OpenHLM's backbone (π0.5) did not. Yet, OpenHLM won decisively. The authors argue that "mixing humanoid data into pretraining is not enough on its own; building a strong humanoid VLA is a question of design details" (Section 4).

Action MSE is a Misleading Metric for Robot Performance

In robotics ML, action Mean Squared Error (MSE) on a validation set is frequently used as a proxy for how well a policy will perform. This paper found that MSE is virtually useless for predicting real-world success. The authors note that "action MSE on held-out validation is virtually indistinguishable between the π0.5- and PaliGemma-initialized models throughout fine-tuning. However, on the robot they diverge sharply" (Section 3.2). They observed similar phenomena with single-step inference, where lower MSE actually resulted in 20-point worse on-robot performance. This implies that teams should not over-index on simulation or validation metrics when evaluating VLA architectures.

Single-Step Inference is Currently a Trap for VLA Models

There is a strong push in the AI community to replace multi-step diffusion or flow-matching with single-step inference to reduce latency. The paper tested both one-step flow matching and a drifting model, finding that while they cut inference latency, "both underperform the baseline by roughly 20 task-progress points" (Section 3.2). The authors hypothesize the single-step actions are "jitterier and less temporally smooth on the robot," suggesting that the temporal coherence provided by multi-step sampling is critical for physical control, even if it costs an extra 30ms.

3. Companies Identified

Physical Intelligence

Description: AI robotics company behind the π0 and π0.5 VLA models. Why relevant: OpenHLM uses π0.5 as its foundational backbone. The paper proves that Physical Intelligence's manipulation pretraining transfers exceptionally well to humanoid platforms, validating their cross-embodiment strategy. Quote: "π0.5’s manipulation prior, especially the closed-loop 'see error, correct, retry' behavior implicit in its pretraining data, transfers despite the embodiment gap." (Section 3.2)

NVIDIA

Description: AI hardware and software giant, developers of the GR00T humanoid platform. Why relevant: NVIDIA's GR00T N1.6 was used as a primary baseline. Despite including humanoid data in its pretraining, it was outperformed by OpenHLM, highlighting that current foundation models still have architectural or data design gaps to close. Quote: "GR00T N1.6... exhibits weak grasping and fail to track language-specified targets, despite including humanoid data in their pretraining." (Abstract, Section 4)

Unitree

Description: Robot hardware manufacturer. Why relevant: The Unitree G1 humanoid robot is the physical platform used for all experiments and data collection in this paper. Quote: "All our loco-manipulation tasks are carried out by a Unitree G1 robot..." (Appendix B.1)

Figure AI

Description: Humanoid robotics company. Why relevant: Referenced in the context of adopting a two-level hierarchical control framework for whole-body loco-manipulation, aligning with industry standard architectures. Quote: "...in line with recent humanoid loco-manipulation stacks [39, 45, 46, 15]." (Section 3.1, citing Figure AI's Helix 02)

4. People Identified

Yingdong Hu

Lab/Institution: Tsinghua University / Shanghai Qi Zhi Institute / Spirit AI Why notable: Lead author who led the manuscript and high-level VLA design. His work demonstrates a clear, pragmatic roadmap for building humanoid control systems without requiring massive proprietary data moats. Quote: "Led the manuscript outline and high-level policy (VLA) design..." (Core Contributors section)

Yang Gao

Lab/Institution: Tsinghua University / Shanghai Qi Zhi Institute / Spirit AI Why notable: Corresponding author and likely principal investigator. The research is backed by the Spirit AI Innovation Program, indicating active industry-academia collaboration in China focused on embodied AI. Quote: Listed as corresponding author and core contributor at Tsinghua University and Spirit AI.

5. Operating Insights

Teleoperation Interface Design Dictates Policy Capability

If you are building a data collection pipeline for humanoids, do not use decoupled control or sparse VR 3-point tracking if you want whole-body behaviors. The interface you use to collect data creates a hard ceiling on what the VLA can learn. The paper found that joint-based whole-body teleoperation (using a PICO VR rig with body trackers) is the only way to capture the degrees of freedom necessary for tasks like squatting or using feet as manipulators. As noted in Section 3.1, "interfaces exposing only a subset of the humanoid’s degrees of freedom make certain tasks unreachable by construction."

Tier Your Data Collection by Task Complexity

CTOs should structure their data acquisition strategies hierarchically. Use expensive, full-body teleoperation (which costs ~1.5 hours per task) only to teach the robot new motions. Once the motion is learned, use cheaper, robot-free data collection methods (like HuMI, which takes only 7 minutes per task) to teach the robot new objects and language instructions. The paper proves this hybrid approach works: "HuMI co-training delivers new semantic understanding but not new motions" (Section 3.3).

40 Demonstrations is a Highly Efficient Sweet Spot

For teams worried about the data scaling laws in robotics, this paper provides a highly encouraging data point. For whole-body loco-manipulation tasks, the returns on teleoperation data flatten quickly. The authors found that "the largest jump comes between 10 and 20 demos per task; returns flatten thereafter, reaching roughly 90% at 40 demos" (Section 3.2, Figure 6). A skilled operator can collect this in just 1.5 hours, meaning a robust multi-task policy can be built with a surprisingly small budget of high-quality data.

6. Overlooked Insights

Low-Level Controller Latency Requires Careful Tuning

When using a motion-tracking controller as the low-level interface, the "future-frame preview latency" (how far ahead the controller looks) is a critical, easily overlooked parameter. If it's zero, the robot stutters during locomotion. If it's too high (0.6s), the operator gets overwhelmed by delay and task progress collapses to 13%. The authors found that "Δt = 0.2 s strikes the best balance," yielding smooth locomotion without unmanageable lag (Section 3.1, Figure 4). Teams deploying similar hierarchical stacks must tune this parameter during data collection, as it directly affects demonstration quality.

SMPL Representations Underperform Robot Joint Space

It is tempting to use human motion capture formats like SMPL (Skinned Multi-Person Linear Model) as the action space for humanoids, as it theoretically skips a retargeting step. However, the paper found that training a VLA on an 81-dimensional SMPL action space resulted in worse performance (75% task progress) than retargeting to the robot's native 32-dimensional joint space (88%). The authors attribute this to the "much higher action-space dimensionality... SMPL’s extra dimensions are largely redundant given the body’s kinematic chain, yet the VLA must still learn to coordinate them all" (Section 3.1, Figure 3).