Simulator Adaptation for Sim-to-Real Learning of Legged Locomotion via Proprioceptive Distribution Matching
- 01Eliminating Motion Capture as a Requirement for Simulator Calibration
- 02Five Minutes of Hardware Data Is Enough to Meaningfully Close the Sim-to-Real Gap
- 03Residual Actuator Models Outperform Static Parameter Tuning for Complex Dynamics
- 04Wasserstein Distance Outperforms Naive L2 Matching Under Real-World Noise
- 05The Framework Generalizes Beyond Its Training Distribution
Oregon State University | Dao & Fern | arXiv 2604.11090 | April 2026
1. Key Themes
Eliminating Motion Capture as a Requirement for Simulator Calibration
The central contribution is a sim-to-real calibration method that works with only onboard joint sensors — no motion capture, no external tracking, no precisely synchronized initial conditions. The authors replace trajectory-aligned state matching with distributional comparison of joint positions, velocities, and actions. As stated in the abstract: "Our approach matches the parameter recovery and policy-performance gains of privileged state-matching baselines across extensive sim-to-sim ablations on the Go2 quadruped." For hardware teams, this removes a major infrastructure bottleneck that has historically kept rigorous simulator tuning in the lab.
Five Minutes of Hardware Data Is Enough to Meaningfully Close the Sim-to-Real Gap
The paper demonstrates that a residual actuator model trained on fewer than 64 four-second rollouts (roughly four minutes of hardware execution) produces substantial drift reduction on real hardware. From the conclusions: "We then showed the ease and practicality of our process by fixing sim-to-real gaps without motion capture and less than 5 minutes of real world data." In the bipedal walking experiment (Table IV), lateral drift dropped from 1.2m to 0.2m — a 5x improvement — after this minimal data collection. This has direct implications for deployment cycle times.
Residual Actuator Models Outperform Static Parameter Tuning for Complex Dynamics
The paper rigorously compares three simulator modification approaches — static friction/armature tuning (FricArm), action-delta networks (ActionDelta), and residual actuator models (ResidAct). For simple parameter shifts, all three work. For complex dynamics like a spring-loaded joint, static tuning fails catastrophically: "all Model Parameter Shift variants fail catastrophically, producing dramatically lower reward and velocity-tracking accuracy than the original policy. In fact, the CMA-ES optimization becomes unstable, continually increasing its exploration variance as it fails to discover any meaningful relationship between the parameters and the matching cost" (Section V-B, Spring Joint). The residual actuator model — a tiny per-joint neural network outputting torque corrections at 1 kHz — is the consistent winner.
Wasserstein Distance Outperforms Naive L2 Matching Under Real-World Noise
The paper's most practically significant finding is the noise robustness comparison. Under realistic hardware noise conditions (σ = 5.0 N·m torque perturbation plus timing jitter), the naive L2 observation-matching baseline (MatchO) produces 86.3% average parameter error. The Wasserstein-based metric produces 18.7% error under identical conditions (Table II). Even the privileged baseline that requires mocap-quality base state (MatchS) degrades to 37.8% error at this noise level — worse than Wasserstein. As the paper puts it: "Wass is competitive at the smallest noise level and significantly better at the two larger noise levels" (Section V-B).
The Framework Generalizes Beyond Its Training Distribution
In the Spring Joint sim-to-real experiment, the finetuned policy successfully generalized to high-speed sideways motion that was never observed during data collection: "Although the robot could not successfully execute high-speed sideways motion (e.g. +1m/s) during data collection, lower-speed trajectories were sufficient for the sim-to-real pipeline to generalize across the entire command space" (Section V-C). This is a meaningful result for deployment teams who cannot safely collect data at the boundaries of the operational envelope.
2. Contrarian Perspectives
Dynamics Randomization Is Not a Substitute for Closing the Sim-to-Real Gap — It's a Workaround That Has Structural Limits
The robotics industry has broadly converged on domain randomization (DR) as the standard tool for sim-to-real robustness. This paper argues that DR is fundamentally misaligned with the goal: "DR does not aim to close the sim-to-real gap, but to make policies robust to it. As tasks grow more complex and gaps widen, DR increasingly shows practical limits" (Section I). The paper further notes that wider randomization ranges actively degrade policy quality: "wider randomization ranges increase task difficulty and prolong training, while excessively large ranges can lead to overly conservative behaviors" (Section I, citing He et al. 2024). Most companies building legged locomotion systems have doubled down on DR. This paper argues they're treating a symptom rather than the disease, and that the cost of that approach compounds as task complexity grows.
Motion Capture Is Not Required for Rigorous Simulator Identification — And Its Presence May Actually Hurt at High Noise Levels
The conventional assumption in high-performance robotics deployment is that accurate system identification requires precise ground-truth state measurement — ideally from a motion capture system. The paper's noisy-conditions experiments directly undermine this assumption. At σ = 5.0 N·m noise, the mocap-dependent MatchS baseline achieves 37.8% average parameter error, while the mocap-free Wasserstein method achieves 18.7% (Table II). The paper explains why: "Time-aligned sim-to-real matching for locomotion... small discrepancies in simulator parameters or initial states cause rollouts to diverge rapidly from hardware trajectories, making pointwise comparisons unreliable" (Section IV). The infrastructure investment in motion capture for sim-to-real tuning of legged systems may be not just unnecessary but actively counterproductive in realistic noise conditions.
Short-Horizon State Matching (the Dominant Paradigm in Recent Agile Locomotion Work) Can Make Things Worse, Not Better
Several prominent recent papers on agile locomotion (explicitly cited here: He et al. 2025's ASAP framework) use short-horizon state matching as their sim-to-real cost. This paper shows that MatchS(1) — resetting the simulator every single step — produces worse finetuned policies than no adaptation at all when applied to expressive neural modification models: "MatchS(1) cost performs particularly poorly. A horizon of one step captures too little dynamical information for high-dimensional neural modification models... This actually causes a larger sim-to-sim gap than was originally present, and consequently leads to finetuned policies that perform worse than the original π₀" (Section V-B). Teams building on or inspired by ASAP-style pipelines should treat this as a direct warning about their cost function choice.
3. Companies Identified
Unitree Robotics
- Description: Chinese quadruped and humanoid robot manufacturer
- Why relevant: The Go2 quadruped is the primary experimental platform for all hardware experiments in this paper. All real-world results — spring joint experiments, bipedal walking — are demonstrated on Go2 hardware. The IsaacLab Go2 USD model is used for training.
- Quote: "All scenarios use the Unitree Go2 quadruped robot. Training is performed in IsaacLab using PPO as the reinforcement learning algorithm with the official Go2 USD model" (Section V-A)
Agility Robotics
- Description: US humanoid robot manufacturer, maker of the Digit humanoid
- Why relevant: The paper attempted to apply this framework to Digit and documents a specific, important failure mode. The results reveal a class of sim-to-real gaps this method cannot yet address: when the initial policy fails outright on hardware, there is no valid data to match against.
- Quote: "We also attempted to apply our framework to a more challenging sim-to-real setting on the Agility Robotics Digit humanoid robot. However, we encountered several practical and conceptual difficulties... our baseline Digit locomotion policy failed outright for backwards walking" (Section VI)
NVIDIA (IsaacLab)
- Description: GPU-accelerated robotics simulation platform
- Why relevant: IsaacLab is the training simulator used throughout. The entire pipeline — RL training, CMA-ES optimization, finetuning — runs in IsaacLab. The framework is directly portable to any team using IsaacLab for legged robot training.
- Quote: "Training is performed in IsaacLab using PPO as the reinforcement learning algorithm with the official Go2 USD model provided in the repository" (Section V-A)
4. People Identified
Jeremy Dao
- Lab/Institution: Collaborative Robotics and Intelligent Systems Institute (CoRIS), Oregon State University
- Why notable: Lead author. His research sits at the intersection of RL-based locomotion and practical sim-to-real deployment. The bipedal walking result on a quadruped platform signals work on morphologically challenging behaviors that will matter for humanoid development.
- Quote: "Real-world experiments demonstrate substantial drift reduction using less than five minutes of hardware data, even for a challenging two-legged walking behavior" (Abstract)
Alan Fern
- Lab/Institution: CoRIS, Oregon State University
- Why notable: Senior author and established figure in robot learning. His prior work spans sample-efficient RL and human-robot interaction. This paper reflects a pragmatic, deployment-oriented research philosophy — moving away from methods that require expensive infrastructure.
- Quote: "We hope that this work will aid others in crossing the sim-to-real gap and enable more complex behaviors to be realized on physical hardware" (Section VII)
Relevant External Researchers (cited and directly compared against)
- Tao He et al. (Carnegie Mellon / ASAP): Their 2025 ASAP paper on agile humanoid skills is the direct privileged-state-matching baseline this work challenges. The MatchS(20) variant corresponds to their approach.
- Nikita Rudin et al. (ETH Zurich, RSL): Walk-in-minutes paper; their actuator network concept is the architectural ancestor of the residual actuator model used here.
- Jonah Miller et al. (Boston Dynamics / ICRA 2025): Most closely related prior work, also using Wasserstein distance on Spot hardware. This paper argues their MMD component is unnecessary and their Wasserstein computation can be efficiently approximated. Quote: "We find that the MMD component is not necessary and that the Wasserstein Distance... can be approximated efficiently with the average of 1D Wasserstein distances across joint dimensions" (Section II-B)
5. Operating Insights
If You Are Tuning a Simulator Without Motion Capture, Stop Using L2 Trajectory Matching
Teams running sim-to-real pipelines with only onboard sensors are likely using some form of direct observation comparison (L2 error between joint trajectories). This paper shows that approach degrades rapidly under real hardware conditions. At moderate noise levels (σ = 2.5 N·m), MatchO achieves 40.7% average parameter error vs. 8.3% for the Wasserstein method (Table II). The implementation cost of switching is low — the 1D marginal Wasserstein approximation is computed in O(n log n) with sorted arrays — and the accuracy gain under real-world noise is large. Any team doing offline simulator calibration should treat this as a near-term engineering change, not a research project.
For Humanoid Deployment, Failure Mode Coverage in Hardware Data Collection Is Now a First-Class Engineering Problem
The Digit experiment surfaces a critical gap in the current methodology that will affect every team deploying humanoids in the field: "This exposes an inherent limitation of our approach: it implicitly assumes that the initial sim-to-real transfer is sufficiently successful to generate informative hardware trajectories within the relevant region of the state space. When this assumption does not hold, the hardware data distribution fails to characterize the states in which the mismatch occurs" (Section VI). Furthermore, failure trajectories are actively harmful: "The system can trivially learn modification parameters that cause arbitrary failure in simulation, thereby matching the observed distribution without corresponding to any meaningful or physically plausible change in dynamics" (Section VI). For humanoid CTOs, this means the data collection protocol — what behaviors to run, in what order, with what safety constraints — is not an afterthought. It directly determines which sim-to-real gaps you can and cannot close.
6. Overlooked Insights
The Residual Actuator Model Runs at 1 kHz — This Is an Architectural Choice With Real Deployment Constraints
Most readers will focus on the matching accuracy results. But buried in Section IV-B is a detail with significant real-time compute implications: "Unlike the action–delta model, which operates at the policy rate, the residual actuator model is evaluated at the simulation rate and therefore offers greater expressivity" — and that simulation rate is 1 kHz (vs. the 50 Hz policy rate). The network itself is small (two layers, hidden sizes [8, 4], per-joint), but deploying 12 of these networks at 1 kHz on embedded hardware requires careful profiling. Teams considering this approach for deployment — not just simulation — need to validate that their onboard compute can sustain this inference rate before committing to the residual actuator architecture over the lower-frequency action-delta alternative.
The Paper's Own Failure on Digit Defines the Boundary Condition for the Entire Class of Distribution-Matching Sim-to-Real Methods
The Digit failure is reported in a single section (VI) and framed as future work, but it is actually the most strategically important finding in the paper for anyone building or funding humanoid systems. The limitation is not specific to this method — it applies to any sim-to-real approach that relies on matching hardware and simulation distributions: if the hardware policy fails in a regime you care about, you cannot collect the distribution you need to match, and optimizing the matching cost may actively mislead your simulator toward representing failure modes. The paper states: "At present, our framework lacks a principled mechanism for incorporating such failure regimes into the data distribution or for bridging sim-to-real gaps in regions of state space that are not represented in hardware data" (Section VI). This is a known-unknown that every investor and engineer evaluating sim-to-real pipelines for complex humanoid behaviors should carry forward — the method works when deployment is already mostly working, but cannot bootstrap a failing policy.