CoCo-InEKF: State… | arXiv Physical AI Research Summary

Paper: CoCo-InEKF: State Estimation with Learned Contact Covariances in Dynamic, Contact-Rich Scenarios Authors: Baumgartner, Müller, Serifi, Grandia, Knoop, Gross, Bächer (Disney Research / ETH Zurich) Why it matters in one sentence: This paper solves one of the most stubborn bottlenecks in deploying legged robots on real hardware — knowing where and how firmly your robot is touching the ground — without requiring labeled contact data, external sensors, or expensive compute.

1. Key Themes

Binary Contact Detection Is a Broken Foundation for Agile Robots

Every classical legged robot state estimator assumes a simple binary truth: each foot is either planted (stationary) or in the air. This is wrong in practice. Slipping, partial contacts, and toe/heel transitions are the norm during dynamic motion, and the binary model catastrophically fails under these conditions. The paper's core argument is that contact is a continuous, directional phenomenon — not a switch.

"Traditional approaches often rely on binary contact states that fail to capture the nuances of partial contact or directional slippage." (Abstract)

The evidence is stark: a standard InEKF with heuristic binary contact detection on dancing motions produces a linear velocity RMSE of 2.675 m/s, compared to 0.046 m/s for CoCo-InEKF — a 58× error reduction. On ground motions, heuristic contacts yield 4.448 m/s RMSE vs. 0.099 m/s for CoCo-InEKF (Table III, Table IV). These aren't benchmark edge cases; this is the difference between a robot that falls and one that dances.

A Lightweight Neural Net Can Replace Hard-Coded Contact Logic — With No Labels Required

The central innovation is a tiny neural network (~240K parameters) that predicts a 3×3 covariance matrix per contact point — encoding how "stationary" each point is, in each direction, at each moment. This network is trained end-to-end through a differentiable Kalman filter using only position/velocity ground truth from simulation. No one needs to label when feet are in contact.

"This approach eliminates the need for heuristic ground-truth contact labels required by previous methods." (Abstract) "By applying backpropagation through time (BPTT), we can train the neural contact module end-to-end using simple state-error losses, avoiding the need for ground-truth contact labels." (Section I)

This is operationally significant: labeling contact states on real robots requires force-torque sensors or manual annotation — both expensive and brittle. CoCo-InEKF sidesteps this entirely.

Real-Time Performance on Consumer-Grade Onboard Hardware

The full state estimator (neural net + Kalman filter) runs in 0.42 ms on an Intel i7 quad-core at 1.7 GHz — comfortably inside a 600 Hz real-time control loop. Competing transformer-based methods (SET) require 2–3 ms and cannot run on this hardware at all.

"Models marked with ∗ cannot run in real time... CoCo-InEKF clearly outperforms the other methods... able to run within a 600 Hz onboard control loop." (Sections V-B1, VI)

This is the difference between a research demo and a deployable system.

Sim-to-Real Transfer Works, and the Gap Is Small

CoCo-InEKF is trained entirely in simulation and deployed on a physical bipedal robot with no real-world fine-tuning. Real-world velocity errors are actually lower than simulation numbers, suggesting the sim training is conservative.

"These real-world results agree with our simulation results in Tab. IV and indicate a low sim-to-real gap of our approach." (Section V-D)

On 13 dynamic dance routines — including unseen pirouettes and moonwalks — CoCo-InEKF achieves a 95% success rate on training-set dances, 100% on unseen pirouettes, and 100% on moonwalks. It outperforms even motion-capture-based state estimation (92% / 90% / 100%), which uses external ground-truth tracking hardware (Table XII).

Filter Consistency: The Underrated Metric That Makes or Breaks Downstream Control

Most papers report accuracy. This paper also reports consistency — whether the filter's internal uncertainty estimate matches actual error. An overconfident filter feeds bad uncertainty to the controller; an underconfident filter ignores good measurements. CoCo-InEKF achieves 52.1% NEES consistency on the combined core state, vs. 18–20% for all binary-contact baselines, and nearly matches the theoretical ceiling set by ground-truth contacts (37.7%).

"CoCo-InEKF matches or exceeds the consistency of the original formulation utilizing ground-truth, privileged information." (Section V-C, Table X)

For robotics companies building downstream controllers that rely on uncertainty estimates (e.g., MPC, risk-aware planners), this is a direct enabler of safer, more capable control.

2. Contrarian Perspectives

Transformers Are Not the Answer for State Estimation at the Edge

The field has broadly embraced transformer architectures for robot learning, and the SET baseline (State Estimation Transformer, from Yu et al. 2024) represents this trend. CoCo-InEKF directly challenges the assumption that bigger, more expressive models win.

"The SET baselines are computationally most expensive, with the highest RMSE... CoCo-InEKF clearly outperforms the other methods." (Section V-B1, Table V)

SET-small (810K params, 2.09 ms inference) achieves 0.279 m/s RMSE on dancing; CoCo-InEKF (240K params, 0.42 ms total) achieves 0.046 m/s. On ground motions, they are comparable in accuracy — but SET cannot run in real time on the robot's onboard computer. The paper's argument is that structure matters more than scale: a physics-informed filter architecture with a small learned component outperforms a black-box transformer, especially at deployment constraints.

You Don't Need to Know Where the Contact Points Are

Conventional contact-aided state estimation requires careful, expert placement of contact frames on the robot model — typically at foot centers or specific joint locations. This paper shows that automated random sampling of contact candidates performs on par with or better than hand-engineered placements.

"The method is insensitive to their exact placement... It can be seen that there is little variation across the randomly initialized, automated selections, and that the performance is similar to or better than the handpicked baseline." (Sections III-C, V-B4, Table IX)

Automated selection (worst/best range) on dancing: [0.052–0.056] RMSE vs. handpicked 0.057. On ground motions: [0.092–0.104] vs. handpicked 0.099. The implication: porting CoCo-InEKF to a new robot morphology doesn't require contact geometry expertise — run farthest-point sampling on the mesh, train, deploy. This dramatically lowers the integration cost for teams working across multiple platforms.

More History Is Not Better — Shorter Context Windows Win

A common intuition in sequence modeling is that longer context = better predictions. The ablation results invert this for state estimation.

"For all methods, H=150 causes performance degradation both in terms of inference time as well as estimation accuracy." (Section V-B1, Table VI)

CoCo-InEKF with H=20 achieves 0.046 RMSE; with H=150 it degrades to 0.052 — while inference time nearly triples (0.14 ms → 0.61 ms). The Kalman filter structure already encodes temporal history through its state covariance; adding more raw history to the neural component introduces noise rather than signal.

3. Companies Identified

Disney Research Description: Disney's robotics and physical simulation research arm, where the majority of authors are affiliated. Why relevant: This paper comes directly from Disney's robotics group, which has been quietly building serious bipedal robot capability (the Lima robot). Disney is not typically framed as a robotics company, but this work — alongside their VMP motion prior and stylized falling paper — suggests a coordinated push toward deployable, physically capable humanoid/bipedal platforms for entertainment and beyond.

"We evaluate models on Lima, a custom bipedal robot (0.84 m, 16.2 kg, 20 DoF) with an onboard computer (Intel i7, 4-Core, 1.7 GHz) running a 600 Hz control loop." (Section IV-A)

Reallusion Description: 3D animation and motion capture software company. Why relevant: The training dataset uses Reallusion's motion capture library (81 sequences of dance motion, retargeted to the Lima robot). This is a practical data pipeline insight: commercial mocap libraries can serve as robot training data for motion-referenced locomotion policies, reducing the need for custom robot-specific capture sessions.

"For the reference motions, we use a subset of the Reallusion dataset, with 81 sequences of 5.6–36.1 s duration, retargeted onto the Lima robot." (Section IV-B)

Intel Description: Semiconductor company; their i7 processor is the onboard compute platform for Lima. Why relevant: The paper's real-time performance claims are benchmarked on an Intel i7 quad-core at 1.7 GHz — a constrained, commercially available chip. This is the relevant compute tier for legged robot deployment today, and the 0.42 ms full-system inference on this chip is a meaningful deployment benchmark.

"An onboard computer (Intel i7, 4-Core, 1.7 GHz) running a 600 Hz control loop... computational timing benchmarks, the models are executed single-threaded on the robot's onboard computer as part of the real-time control loop." (Section IV-A)

Nvidia Description: GPU manufacturer. Why relevant: Training runs on a single RTX 4090 for up to 5 days with 1,280 parallel simulation environments. This is a reasonable training cost signal for teams evaluating whether to replicate this work — one consumer GPU, under a week.

"Models are trained on a single Nvidia RTX 4090 GPU for 100k iterations, or a maximum of 5 days, with E=1280 parallel environments." (Section IV-A)

4. People Identified

Michael Baumgartner Lab/Institution: ETH Zurich / Disney Research Why notable: Lead author; primary architect of the CoCo-InEKF system. Located at the intersection of academic rigor (ETH) and deployment-focused industrial research (Disney). Worth tracking as a researcher likely to produce follow-on work on multi-modal (vision + proprioception) state estimation.

"We are eager to explore whether incorporating real-world data or greater training diversity can further improve performance." (Section VI)

Ruben Grandia Lab/Institution: Disney Research Why notable: Co-author on CoCo-InEKF and the VMP (Versatile Motion Priors) paper cited as the locomotion policy backbone. Grandia is a key figure in Disney's legged robot stack — state estimation, motion priors, and control are all connected through his work. Previously at ETH Zurich's Robotic Systems Lab (RSL), a top-tier legged robotics group.

"We use a VMP policy that tracks arbitrary kinematic reference motions." (Section IV-B, citing Serifi et al. 2024, which includes Grandia as co-author)

Agon Serifi Lab/Institution: Disney Research Why notable: Co-author on CoCo-InEKF and primary author of the VMP motion prior framework that the dancing policy runs on. The VMP + CoCo-InEKF combination is essentially Disney's full locomotion stack for dynamic bipedal motion.

"We use a VMP policy [34] that tracks arbitrary kinematic reference motions." (Section IV-B)

Moritz Bächer Lab/Institution: Disney Research Why notable: Senior author and research lead. Bächer has been central to Disney Research's physical character animation and robotics work for over a decade. His group's output (this paper, VMP, the stylized falling paper) suggests a coherent multi-year program toward autonomous physical characters.

Listed as senior/corresponding author throughout.

Tzu-Yuan Lin & Maani Ghaffari (University of Michigan) Lab/Institution: University of Michigan Robotics Why notable: Authors of the "Hybrid Baseline" method (Lin et al. 2022) that CoCo-InEKF directly competes with and outperforms. Also authors of DRIFT, the modular InEKF implementation. Ghaffari's group is the primary academic lab working on InEKF-based legged robot state estimation — this paper is a direct successor to their line of work.

"Lin et al. proposed augmenting the state-of-the-art invariant extended Kalman filter (InEKF) with learned contact detection. However, this approach requires labeled contact data for training and still treats contact as a binary state." (Section I)

5. Operating Insights

Contact-Aware State Estimation Is a Prerequisite, Not an Afterthought, for Dynamic Robot Deployment

Teams building locomotion stacks often treat state estimation as solved ("we'll use an EKF with foot contacts") and focus engineering effort on control and planning. This paper demonstrates that for anything beyond quasi-static walking — including the stair climbing, getting-up, and manipulation-while-moving scenarios that actually matter commercially — binary contact estimation causes estimator divergence.

The practical test: if your robot's state estimator relies on foot velocity thresholds or height thresholds to determine contact, you have a system that will fail on any surface with friction variation or any motion with ground impact. The heuristic baseline in this paper achieves 4.4 m/s velocity RMSE on ground motions — effectively unusable. CTOs evaluating locomotion software stacks should ask: what is the contact detection strategy, and has it been tested under slippage and partial contact conditions?

"The InEKF approach with heuristic contact detection exhibits very high errors, indicating estimator divergence." (Section V-A2)

The BPTT Training Paradigm Enables Label-Free Adaptation to New Robot Platforms

The most operationally underappreciated aspect of this paper is its training methodology. Because the entire pipeline — contact covariance prediction → Kalman filter → state error — is differentiable, you can train the contact module using only ground-truth state trajectories from a simulator. No contact labeling. No force sensors. No manual threshold tuning per robot.

For teams deploying across multiple robot morphologies (a real challenge for humanoid and quadruped companies scaling to different hardware SKUs), this means the contact estimation module can be retrained for a new robot using only: (1) a simulator, (2) a control policy, and (3) state ground truth. The automated contact point selection procedure removes the last manual step.

"Our method addresses this by learning contact velocity covariances end-to-end via a differentiable formulation... we demonstrate that our method is insensitive to their exact placement." (Sections I, III-C)

6. Overlooked Insights

Filter Consistency as a Compounding Advantage for Closed-Loop Control

The NEES (Normalized Estimation Error Squared) results in Section V-C are buried in what looks like a statistical validation exercise, but they have direct operational consequences. A consistent filter is one whose internal uncertainty estimate accurately reflects actual error — which means downstream controllers (MPC, risk-aware planners, any system that ingests covariance estimates) can trust the filter's confidence signals.

All binary-contact baselines are severely underconfident — their NEES values far exceed the expected chi-squared bounds (Table X: Hybrid Baseline achieves only 18.2% of time steps within the 95% confidence interval, vs. 52.1% for CoCo-InEKF). An underconfident filter tells the controller "I'm very uncertain" when it shouldn't be, leading to overly conservative or erratic behavior. CoCo-InEKF achieves better consistency than even the ground-truth contact baseline (52.1% vs. 37.7%) — without optimizing for this metric at all.

"CoCo-InEKF matches or exceeds the consistency of the original formulation utilizing ground-truth, privileged information... our formulation thus improves the filter consistency, despite not explicitly optimizing for this metric." (Section V-C)

For companies building whole-body controllers or planning systems that propagate uncertainty, this is a silent performance multiplier that won't show up in accuracy benchmarks alone.

The 128-Step BPTT Horizon Is a Critical Hidden Hyperparameter

The BPTT unroll length ablation (Table VII) reveals a non-obvious training fragility: too short (L=64) and the model cannot learn long-horizon contact dynamics, yielding 0.066 RMSE; optimal (L=128) gives 0.046; too long (L=256) degrades to 0.051 RMSE while also dramatically reducing the number of training iterations achievable in the same wall-clock time (18,800 vs. 64,600 iterations). This means teams attempting to replicate or extend this work face a non-trivial compute budget allocation problem — longer horizons are more expensive per iteration and may hurt performance via gradient pathologies.

"Significant performance deterioration is seen for the shorter buffer size, presumably because the model cannot predict longer-horizon effects. Performance is also slightly worse for the longer buffer size, which could be explained by vanishing or exploding gradients, or due to higher training cost leading to fewer training iterations." (Section V-B2)

Anyone adapting this framework to a new platform should treat L as a first-order hyperparameter requiring explicit tuning, not a fixed design choice inherited from this paper.