Teahose.
SIGN IN
NEW HERE — WHAT TEAHOSE DOES
We read the entire AI & tech firehose — so you don't have to.
PODPodcastsAll-In, No Priors, Acquired…
NEWNewslettersStratechery, Newcomer…
PAPPapersPhysical AI research
PHProduct Huntdaily launches
VCInvestor ScoutSequoia, a16z, Benchmark…
CLAUDE DISTILLS →
7 reads, 30 sec each — free, 6 AM ET.
+ a live graph of the companies, people & themes underneath.
HOME/ARXIV PHYSICAL AI RESEARCH/AnoleVLA: Lightweight Vision-Lan…
PAPR
// RESEARCH PAPER
ARXIV PHYSICAL AI RESEARCH

AnoleVLA: Lightweight Vision-Language-Action Model with Deep State Space Models for Mobile Manipulation

DATE April 20, 2026SOURCE ARXIV PHYSICAL AI RESEARCHPARTICIPANTS YUSUKE TAKAGI, KOMEI SUGIURA, ET AL. (ARXIV PHYSICAL AI)ARXIV 2603.15046
// KEY TAKEAWAYS4 ITEMS
  1. 01Replacing Transformer Backbones with State Space Models Breaks the Latency Wall
  2. 02Smaller and Faster Can Beat Larger and Slower
  3. 03Trajectory Smoothness Is a First-Class Engineering Requirement, Not an Afterthought
  4. 04Data Efficiency at 50 Episodes Per Task Is a Real-World Signal
// SUMMARY

Why Should You Care?

Most VLA research optimizes for benchmark performance on server-grade hardware. AnoleVLA is one of the first papers to directly target the constraint that actually kills real-world deployments: you can't put a 7B-parameter model on a mobile robot and expect it to run fast enough to be useful. The paper delivers a 0.47B model that beats a 3B model by 21 points in real-world task success and runs 3x faster. That's the kind of tradeoff that makes deployment teams pay attention.


1. Key Themes

Replacing Transformer Backbones with State Space Models Breaks the Latency Wall

The core architectural bet is swapping transformer self-attention (which scales quadratically with sequence length, O(L²)) for Mamba, a selective deep state space model that processes sequences in linear time O(L). This matters because VLAs concatenate vision tokens, language tokens, and robot-state tokens into a single long sequence — exactly the regime where transformer cost explodes.

The practical result: AnoleVLA (0.47B parameters) achieves 216 ms/chunk inference on real hardware, versus 578 ms/chunk for π0.5 (3B parameters) — "approximately three times faster" (Section VI-B, Table II). For a mobile manipulator doing closed-loop control, cutting inference latency from ~580ms to ~216ms is the difference between a robot that feels responsive and one that feels broken.

Smaller and Faster Can Beat Larger and Slower — With the Right Architecture

This is the headline result. AnoleVLA achieves 63% average success rate across five real-world manipulation tasks (move, pick, open, close, push), compared to 52% for SmolVLA, 42% for π0.5, 40% for VLA-Adapter, and 29% for TinyVLA — all at 0.47B parameters (Section VI-B, Table II).

The paper attributes π0.5's poor real-world performance specifically to two deployment realities: "small training dataset of only 50 episodes per task made it difficult to fine-tune π0.5... massive architectures with 3B parameters generally suffer from low sample efficiency with such limited data" and "a substantial domain gap exists regarding visual observations... the representations acquired during pre-training did not transfer effectively to our experimental setting" (Section VI-B). This is a directly reproducible failure mode that any team deploying large pre-trained VLAs should expect.

Trajectory Smoothness Is a First-Class Engineering Requirement, Not an Afterthought

Most VLA papers optimize for task success rate and treat motion quality as secondary. AnoleVLA introduces a two-stage training strategy where the first stage minimizes L1 loss on predicted action velocities, and the second stage adds an "acceleration loss" supervising the temporal differences between consecutive actions (Section IV-C, Equations 5 and 6).

The ablation is direct: removing the second training stage drops average success rate by 4.73 points (63.12% vs. 67.85%, Table III). The physical experiment makes the stakes concrete: on the "Open drawer" task — a contact-rich, spatially constrained operation — SmolVLA achieves 55% while AnoleVLA achieves 75%, a gap the paper attributes specifically to "the lack of temporal smoothness constraints for conventional VLAs, which often causes jerky movements" (Section VI-B). For anyone deploying manipulation in environments with fragile objects or human co-presence, jerky trajectories are a safety issue, not just a performance metric.

Data Efficiency at 50 Episodes Per Task Is a Real-World Signal

The physical experiment was trained on only 250 total demonstrations (50 per task), collected via teleoperation with a leader-follower system on Toyota's Human Support Robot (Section VI-A). The fact that AnoleVLA achieves 63% success under this constraint — while π0.5 with 3B parameters and massive pretraining achieves only 42% — is a strong signal about sample efficiency of SSM-based architectures relative to large transformer VLAs under distribution shift.


2. Contrarian Perspectives

More Pretraining Data and Larger Models Do Not Guarantee Better Real-World Performance

The dominant narrative in VLA research is "scale up pretraining, generalize everywhere." AnoleVLA's results challenge this directly. π0.5, trained on massive cross-embodiment datasets and carrying 3B parameters, achieves only 42% average success rate in real-world trials — 21 points below a 0.47B model trained on 250 domain-specific episodes.

The paper is explicit about why: "Although π0.5 is heavily pre-trained, massive architectures with 3B parameters generally suffer from low sample efficiency with such limited data, making effective adaptation challenging without overfitting" (Section VI-B). The implication is that a well-architected lightweight model, fine-tuned on modest domain-specific data, can outperform a foundation model that hasn't seen your specific robot, viewpoints, or task distribution. Most companies evaluating VLA vendors should be running this exact comparison on their own hardware before committing to large-model deployments.

Continuous Action Generation Without Discretization Outperforms Autoregressive Token Decoding for Manipulation

The conventional VLA approach — inherited from large language models — discretizes continuous robot actions into tokens and uses autoregressive decoding. OpenVLA, RT-2, and Octo all do this. AnoleVLA explicitly rejects this design: "unlike many existing VLAs that discretize continuous actions into tokens to use the language model's autoregressive decoding, AnoleVLA directly generates continuous actions from the final hidden state of the backbone" (Section I).

The results suggest this is the right call for manipulation. Discretization introduces quantization error into fundamentally continuous control signals, and autoregressive decoding adds latency with each token step. AnoleVLA generates an entire H-step action chunk in a single forward pass (Section IV-B, Equation 4). This is a meaningful design philosophy divergence from the LLM-derived VLA orthodoxy — and the benchmark numbers support it.

Inference Speed Benchmarks Without Accuracy Are Misleading

VLA-Adapter achieves the fastest inference at 101 ms/chunk — faster than AnoleVLA's 216 ms/chunk. But VLA-Adapter achieves only 40% average real-world success rate versus AnoleVLA's 63% (Table II). The paper frames this explicitly: "VLA-Adapter achieved the lowest latency at 101 ms/chunk in our setup. However, it yielded substantially lower success rates, which indicates that AnoleVLA provided a better balance between the success rate and inference speed under limited compute budgets" (Section VI-B).

For investors and CTOs evaluating robotics companies, speed claims without paired accuracy claims on real tasks are a red flag. The relevant metric is success-rate-per-millisecond, not either metric in isolation.


3. Companies Identified

Physical Intelligence (π) Leading VLA research lab, makers of π0 and π0.5. Their π0.5 model (3B parameters) is used as the primary large-scale baseline throughout the paper. The real-world result — 42% average success rate at 578 ms/chunk, outperformed by a 0.47B model — raises direct questions about deployment readiness of large foundation VLAs outside their training distribution. "Compared with π0.5, AnoleVLA not only yielded superior task performance but also demonstrated an inference speed approximately three times faster" (Abstract). Relevant to investors evaluating Physical Intelligence's deployment timeline and the extent to which their models require significant domain-specific fine-tuning.

Toyota Motor Corporation Developer of the Human Support Robot (HSR) used as the physical platform for all real-world experiments. "We used the Human Support Robot (HSR) developed by Toyota Motor Corporation. This mobile manipulator has been used as the standard platform for the RoboCup@Home competition since 2017" (Section VI-A). Toyota's active investment in domestic service robot platforms makes this a credible deployment context, not a lab curiosity. The HSR's 11-DoF configuration (3 mobile base, 6 manipulator, 2 head) represents realistic home/service robot complexity.

Hugging Face (SmolVLA) SmolVLA (0.45B parameters) is the strongest small-scale baseline and the most direct competitive comparison. "AnoleVLA outperformed SmolVLA, the strongest baseline, by 10.52 points [simulation] and... outperformed SmolVLA, the strongest among the small-scale VLAs, by 11 points for the average success rate [real world]" (Sections V-B, VI-B). SmolVLA is positioned as the current benchmark for affordable, efficient VLAs — AnoleVLA directly contests that position.


4. People Identified

Komei Sugiura Keio University, Japan. Principal investigator on the AnoleVLA project. Also cited as co-author on adjacent work in interactive robot replanning with multimodal LLMs (Reference [12]) and the RoboCup@Home competition (Reference [14]), suggesting a sustained research program in language-guided service robotics. Notable for grounding VLA research in domestic/service robot deployment contexts rather than industrial arm settings.

Yusuke Takagi Keio University, Japan. Lead author. Contact: yusuke.10.06@keio.jp. Primary architect of the AnoleVLA design, training strategy, and experimental evaluation. Supported by JSPS Fellows Grant (JP23KJ1917), indicating dedicated doctoral-level funding for this research direction.

Albert Gu and Tri Dao (Referenced, not at Keio) Stanford / Together AI. Creators of the Mamba architecture (Reference [8]) that forms AnoleVLA's backbone. Their work on "Linear-Time Sequence Modeling with Selective State Spaces" is the foundational enabling technology. Gu is also credited for S4, the predecessor structured state space model. Their work is rapidly becoming infrastructure-level for efficient sequence modeling in robotics.

Chelsea Finn (Referenced, not at Keio) Stanford / Physical Intelligence. Co-author on π0.5 (Reference [1]), Meta-World (Reference [38]), and the VLA fine-tuning study (Reference [17]). Her lab's work is both the primary benchmark target and a key methodological reference point throughout the paper.


5. Operating Insights

Deploy SSM-Based Models When You Have Limited Onboard Compute and Small Domain Datasets

If your robot platform cannot run a 3B+ parameter model in real-time, or if you have fewer than a few hundred domain demonstrations per task, the evidence from this paper suggests that a well-designed 0.47B SSM-based VLA will outperform a large pre-trained transformer VLA. The operating threshold the paper demonstrates: 50 episodes per task, single RTX 4090 class GPU, 11-DoF mobile manipulator, five task categories. "These findings demonstrate that AnoleVLA is highly sample-efficient and capable of robust task acquisition even with severely limited demonstration data" (Section VI-B).

For teams building service robots, logistics robots, or any mobile manipulation system where onboard inference is required and large-scale pretraining data isn't available for your specific domain, this is a directly actionable architecture recommendation.

Add Trajectory Smoothness Supervision Explicitly — Don't Assume It Emerges from Imitation Learning

The acceleration loss (supervising temporal differences of predicted actions, not just the actions themselves) is a low-cost addition to any manipulation training pipeline. The 4.73-point success rate improvement in simulation (Table III) and the 20-point advantage on contact-rich "Open" tasks in physical experiments (Section VI-B) both point to the same mechanism: standard imitation learning produces velocity-accurate but acceleration-inconsistent trajectories, which fail on tasks requiring sustained, precise contact.

Any team training manipulation policies should evaluate whether their current loss function supervises trajectory smoothness, not just endpoint or per-step accuracy. This is particularly critical for tasks involving doors, drawers, latches, or any contact-rich interaction where jerky motion causes task failure or hardware stress.


6. Overlooked Insights

The Failure Mode Taxonomy Has Direct Product Implications

The error analysis section (Section V-E, Table IV) categorizes 20 real failure cases: 50% are "position recognition errors" (arm moves to wrong location entirely), 30% are "grasp point prediction errors" (reaches target but grips wrong point), and 20% are "incomplete motion execution" (trajectory starts correctly but terminates early). The dominant failure — object localization — is not a model capacity problem but a spatial reasoning problem.

The paper notes: "Manual inspection of the failed episodes suggests that the model often generated grasping motions toward incorrect regions or empty space, which indicates inappropriate localization of the target object from visual observations. This may reflect a limitation in extracting precise spatial information from the input images" (Section V-E). This is a systemic failure mode shared across current VLA architectures, not unique to AnoleVLA. Teams deploying manipulation systems should build explicit failure detection around incorrect-position grasps (e.g., force-torque sensing, grasp success detection) rather than assuming model improvements alone will resolve it.

The Multi-View Camera Setup Is Underreported but Operationally Critical

The physical experiment uses three simultaneous camera viewpoints — external, head-mounted, and wrist-mounted — collected at 10 Hz with all joint angles (Section VI-A). This is a non-trivial data infrastructure requirement that most lab-scale manipulation papers omit. The Mamba backbone's linear complexity with sequence length is specifically advantageous here: fusing multi-view visual tokens into a single sequence would create prohibitive cost for transformer-based models, but AnoleVLA's O(L) scaling makes multi-camera fusion computationally tractable. Teams building mobile manipulation systems should note that multi-view inputs + SSM backbones may be a natural pairing that improves spatial coverage without the latency penalty that would make it impractical with transformer architectures. This connection is mentioned briefly ("AnoleVLA leverages Mamba's linear complexity to efficiently fuse multi-view features," Section VI-B) but never systematically studied — a gap worth investigating.