EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
- 01The VLM-to-VLA Transfer Gap Is Measurable and Larger Than Most Teams Assume
- 02Proximity-Based Data Selection Outperforms Naive Mid-Training and All Hand-Crafted Alternatives
- 03A 1.1B Mid-Trained Model Beats 7–8B Expert VLAs at a Fraction of the Training Cost
- 04The Initialization Quality Determines the Trajectory of All Downstream Fine-Tuning
- 05Spatial Reasoning Is the Key Differentiating Signal for Embodied Transfer
Why Should You Care?
Most robot AI systems today are built by taking a general-purpose vision-language model (VLM) off the shelf — think Qwen, PaliGemma, or InternVL — and fine-tuning it on robot manipulation data. This paper's central argument: that's leaving performance on the table because the VLM's internal representations were never shaped for embodied reasoning. EmbodiedMidtrain inserts a lightweight "alignment" step between VLM pretraining and VLA fine-tuning that costs a fraction of normal training compute but consistently delivers performance competitive with models 3–8x larger.
1. Key Themes
The VLM-to-VLA Transfer Gap Is Measurable and Larger Than Most Teams Assume
The paper quantifies something practitioners often hand-wave: how misaligned standard VLM training data actually is from robot manipulation data. Using Maximum Mean Discrepancy (MMD) across feature spaces, they show that "MMD distances are generally smaller within the VLM group and within the VLA group than across the two groups, quantitatively confirming a clear distributional mismatch" (Section 3). The t-SNE visualization makes this visceral — VLA data clusters in tight, isolated islands while VLM data occupies sprawling, diffuse regions. This isn't a minor statistical gap; it's a fundamental representation mismatch that every VLA team is implicitly fighting every time they fine-tune.
Proximity-Based Data Selection Outperforms Naive Mid-Training and All Hand-Crafted Alternatives
The core technical contribution is a lightweight binary classifier trained on frozen VLM features that learns to score VLM samples by how "robot-like" they are. The key ablation result: random sampling from the VLM pool (i.e., just doing more mid-training) scores Calvin Avg. 3.398, while the learned proximity estimator scores 3.714 on the same benchmark (Table 2). Three alternative hand-crafted scoring methods — feature-space distance, VLA-conditioned perplexity, and delta perplexity — all underperform. The paper's conclusion: "the gains do not come merely from additional mid-training, but from identifying and retaining the subset of VLM data that is better aligned with the VLA domain" (Section 6.1). This is a critical engineering insight: more compute on the wrong data doesn't help.
A 1.1B Mid-Trained Model Beats 7–8B Expert VLAs at a Fraction of the Training Cost
The headline result: EmbodiedMidtrain applied to InternVL3.5-1B (1.1B parameters, 1.0M/4.1M/4.1M training samples) scores 3.714 average task length on Calvin ABC-D, surpassing both OpenVLA at 7.7B (2.548) and π₀ at 3.1B (3.509), while using training budgets roughly 6-25x smaller (Table 1). On SimplerEnv-Bridge, the mid-trained 1.1B model ties Qwen3VL-4B (56.3% vs 56.3%) despite being 4x smaller. The paper frames this precisely: "what matters most is not the volume of pretraining data a VLM has seen, but how well the mid-training data aligns with the downstream embodied distribution" (Section 5.3).
The Initialization Quality Determines the Trajectory of All Downstream Fine-Tuning
A subtle but strategically important finding: the mid-trained model doesn't just start higher — the gap widens over training. "The mid-trained model already achieves higher performance in the early stages of fine-tuning... the gap widens rather than narrowing over time" (Section 6.2). Crucially, training loss curves look nearly identical between the two initializations, meaning loss is not a reliable proxy for the quality of the initialization. Teams monitoring only training loss during VLA fine-tuning are flying blind with respect to initialization quality.
Spatial Reasoning Is the Key Differentiating Signal for Embodied Transfer
The proximity estimator learns, without any manual labeling, to prefer spatially grounded data over text-centric data. RefSpatial achieves the highest average proximity scores; VCR (Visual Commonsense Reasoning) receives the lowest and is nearly absent from the final selected mixture (0.0%) (Appendix A.6, Figure 4). A high-scoring sample asks about spatial coordinates of objects in a scene; a low-scoring sample asks who wrote a book based on its cover. This validates a hypothesis many robotics ML teams hold intuitively but rarely act on systematically: spatial reasoning data is categorically more valuable for robot foundation models than general visual QA.
2. Contrarian Perspectives
Bigger VLM Backbones Are Not the Answer — Better Initialization Is
Conventional wisdom in the VLA space has been to chase larger backbone models. The results here directly challenge that. Qwen3VL-30B-A3B (30 billion parameters) scores 4.075 on Calvin and 44.8 on SimplerEnv. The mid-trained InternVL3.5-1B (1.1 billion parameters) scores 3.714 and 56.3 respectively — beating the 30B model on SimplerEnv while using roughly the same fine-tuning budget but a backbone 27x smaller (Table 1). The paper's argument is structural: "most VLAs initialize from general-purpose off-the-shelf VLMs that are not tailored toward embodied action generation" (Section 1). Scaling an already-misaligned backbone may be a locally optimal but globally inefficient strategy.
Fine-Tuning VLMs on Embodied Tasks Does Not Reliably Improve VLA Performance
This challenges a large body of work on "embodied VLMs." The paper cites Zhang et al. (2026) directly: "simply fine-tuning VLM on curated embodied data does not reliably translate into better VLA performance" (Section 1). The community has invested heavily in embodied VQA datasets, spatial benchmarks, and robot trajectory VQA conversions — but the paper's framework suggests these may improve VLM-side benchmarks without closing the distributional gap that actually matters for action generation. The right question is not "does the VLM score better on spatial benchmarks?" but "does the VLM's representation space overlap better with VLA training data?" These are measurably different questions.
Dataset-Level Curation Is Insufficient — You Need Sample-Level Selection
Many teams making data decisions operate at the dataset level: "we'll include RefSpatial and exclude VCR." The paper shows this is a coarse approximation. "Even within the same source individual samples vary substantially in their relevance to VLA tasks... the estimator also performs fine-grained sample-level selection, retaining only the most VLA-aligned samples even from high-scoring datasets" (Section 4). The most striking evidence: LAION-400M, an overwhelmingly general image-caption dataset, contributes 32% of the final selected mid-training mixture (Appendix A.6) — not because it's uniformly good, but because sample-level selection mines the useful fraction from it. Dataset-level filtering would have discarded this entirely.
3. Companies Identified
Physical Intelligence (π₀, π₀.5) Leading VLA company using PaliGemma as their VLM backbone. Their π₀ model (3.1B parameters) is used as an expert VLA baseline and scores 3.509 on Calvin ABC-D — outperformed by the mid-trained 1.1B InternVL model (3.714). This is strategically relevant: their choice of PaliGemma as backbone may benefit from the mid-training paradigm proposed here. Referenced extensively as a state-of-the-art deployment-grade system. "π₀ is based on Paligemma-1 VLM and models continuous robot actions using flow matching" (Section 5.2).
NVIDIA (GR00T N1) Their humanoid foundation model uses Eagle-2 as the VLM backbone. The paper identifies GR00T N1 as representative of the pattern they're addressing: "a common thread is that the VLM backbone is taken off-the-shelf from general-purpose training without specific preparation for the embodied domain" (Section 2). GR00T N1's architecture — VLM backbone with dedicated cross-attention action decoder — is directly implicated by the paper's core finding.
Google DeepMind (Gemini Robotics / GR Team) Referenced as part of the frontier VLA landscape. Gemini Robotics 1.5 is cited as an example of recent advances in robotics foundation models (Section 1). Their approach represents the large-scale, high-resource end of the VLA spectrum against which this paper positions its efficient alternative.
Bosch Research North America / BCAI Two of the paper's five authors (Xin Ye, Liu Ren) are from Bosch Center for Artificial Intelligence. This signals Bosch's active investment in VLA research infrastructure — relevant for industrial robotics deployment contexts. Their institutional involvement suggests this line of research has industrial application targets, not just academic ones.
Alibaba / Qwen Team Qwen2.5VL and Qwen3VL family serve as the primary off-the-shelf VLM baseline. Results show Qwen3VL-2B at 2.1B parameters scoring 4.142 on Calvin without mid-training — one of the stronger baselines in the study. The paper demonstrates that their model also benefits from EmbodiedMidtrain: "applying the mixture selected with InternVL3.5-1B to Qwen3VL-2B yields consistent gains" (Section 1). Qwen models are implicitly positioned as the current high-water mark for off-the-shelf VLMs in VLA settings.
Stanford / OpenVLA Team OpenVLA (7.7B, Llama-2 backbone) serves as the expert VLA baseline and is notably the worst-performing model on Calvin (2.548 avg. length) and SimplerEnv (0% success rate). "OpenVLA is built on Llama-2-7B with DINOv2 and SigLIP visual encoders, and models robot control by autoregressively predicting discretized action tokens" (Section 5.2). Its underperformance relative to smaller models reinforces the paper's core thesis about alignment mattering more than scale.
4. People Identified
Yiyang Du Language Technologies Institute, Carnegie Mellon University. Lead author. Also co-author of EmbSpatial-Bench (2024), a spatial understanding benchmark for embodied tasks that is used as part of the VLM candidate data pool in this work. Represents a researcher who has been building the data infrastructure (benchmarks, datasets) and now is applying it to model training pipelines. Contact: yiyangd@cs.cmu.edu.
Zhanqiu Guo Language Technologies Institute, Carnegie Mellon University. Co-author on the core methodology. Contact: zhanqiug@cs.cmu.edu.
Chenyan Xiong Language Technologies Institute, Carnegie Mellon University. Senior author. Known for work on information retrieval and representation learning. His involvement signals that the proximity estimation methodology draws on well-established principles from information retrieval and data selection for LLM pretraining — the paper explicitly cites the DSIR framework (Xie et al., 2023) from Percy Liang's group at Stanford as methodological inspiration. Contact: cx@cs.cmu.edu.
Xin Ye and Liu Ren Bosch Research North America and BCAI. Industry co-authors bridging this research toward industrial robotics application. Liu Ren's presence as a senior Bosch researcher suggests this work has a deployment pathway in manufacturing or service robotics contexts.
Jianke Zhang et al. (VLM4VLA team) Not authors of this paper but critically important: their VLM4VLA paper (Zhang et al., 2026) provides the VLA fine-tuning framework, architecture, and baseline results that EmbodiedMidtrain builds on. The paper states: "we initialize a VLA from the resulting VLMs and fine-tune it following the VLA training pipeline of VLM4VLA" (Section 5.1). Understanding EmbodiedMidtrain requires understanding VLM4VLA — they are tightly coupled.
5. Operating Insights
Insert a Mid-Training Stage Between Your VLM and VLA Fine-Tuning — It Is Cheap and Consistently Beneficial
The operational ask here is modest: train a linear classifier for roughly 75–100 steps (Section Appendix A.1), use it to score and filter your VLM training pool down to ~1.2M samples, then run 5,000 steps of full-parameter fine-tuning before your normal VLA training pipeline. The payoff — closing the gap between a 1.1B model and 7–8B baselines — is substantial relative to the compute investment. The pipeline "requires no architectural changes to the underlying VLM or VLA" (Section 1), meaning it can be inserted into existing training workflows without redesign. For teams running repeated VLA fine-tuning experiments, this mid-training needs to be done only once per backbone, as the selected data transfers across fine-tuning targets.
Spatial Reasoning Data Should Be Treated as a First-Class Training Asset, Not an Afterthought
The proximity estimator's preferences are explicit and actionable: RefSpatial and RoboPoint are retained at high rates; pure text-based VQA is nearly entirely discarded. The selected mixture ends up being 19.9% RoboPoint and 14.7% RefSpatial (Appendix A.6), despite these being smaller datasets. For teams constructing data pipelines for VLM pretraining or mid-training, this is a clear signal: invest in spatial referring, affordance prediction, and grounded QA datasets over general commonsense or text-heavy visual QA. The model explicitly learns that "spatial reasoning over text-centric tasks" is the differentiating signal (Abstract).
Do Not Use Training Loss as Your Primary Signal for Initialization Quality
This is a counterintuitive but important operational warning: "training loss remains highly similar across the two initializations" despite substantial downstream performance differences (Section 6.2). Teams that checkpoint and evaluate only based on training loss metrics during VLA fine-tuning may fail to detect the value (or lack thereof) of their VLM initialization. Building in early-stage downstream task evaluations — the paper shows the gap is visible "from the earliest fine-tuning steps" — is a more reliable diagnostic for whether your VLM backbone is well-suited for your target tasks.
6. Overlooked Insights
The Selected Data Preserves Diversity — This Is What Makes It Actually Work
Buried in Appendix A.5 is a finding that explains why mid-training generalizes rather than overfits: the proximity-selected subset achieves a diversity score of 1.93, nearly matching the full general VLM pool (1.96) and far exceeding VLA data (1.26). "Rather than simply duplicating VLA-like patterns, it exposes the VLM to a wide variety of visual and linguistic contexts that are nonetheless relevant to downstream embodied tasks" (Appendix A.5). This is not obvious: one would expect proximity-based selection to cluster the data near VLA examples and collapse diversity. It does not. The implication for practitioners: you can run this selection without the typical data diversity vs. specialization tradeoff — you get both. This diversity preservation is likely the reason the selected data transfers across different VLM backbones (InternVL → Qwen3VL), a finding the main text notes but does not fully explain.
Mid-Training Selectively Degrades Some VLM Capabilities — With Specific, Predictable Patterns
Appendix A.7 contains a result that teams deploying VLMs for multi-task purposes need to see: after mid-training, BLINK scores drop from 43.45 to 40.45 and SpatialEval drops from 49.82 to 48.00, while VisuLogic improves (21.00 → 24.90) and 3DSRBench improves (47.87 → 49.51), with POPE nearly unchanged (Table 7). The paper frames this as "a selective shift... reorienting the model toward skills more relevant to embodied downstream adaptation" (Appendix A.7). For any team using a single VLM backbone for both robot control and general-purpose tasks (e.g., a robot that also handles customer queries or document understanding), this mid-training step will trade general visual reasoning capability for embodied performance. This is a real deployment tradeoff that the abstract does not flag but the appendix makes clear.