LAMP: Lift Image-Editi… | arXiv Physical AI Research Summary

Why should someone building or funding robots care? LAMP solves one of the most persistent failures in open-world robot deployment: robots that can't figure out where exactly to put things in 3D space based on a language instruction alone. It achieves 66% success across 10 diverse manipulation tasks zero-shot — roughly 2.75x better than the next best baseline — with nothing but a standard RGB-D camera and off-the-shelf models.

1. Key Themes

Using Image Editing as a Spatial Reasoning Engine

The paper's central insight is that modern image-editing models (like Gemini 2.5 Flash and Qwen-Image-Edit) already implicitly "know" how objects should move and relate to each other in a scene. Rather than training a robot on manipulation data, LAMP queries an image editor with a task instruction ("insert the toast into the toaster"), gets a picture of the desired end state, then uses that picture to compute a precise 3D transformation. As the authors state: "image-editing models implicitly encode rich spatial priors in the 2D visual domain: how an object should move, rotate, or interact within a scene" (Section 1). This is a zero-shot approach — no robot training data required.

Language and VLMs Are Fundamentally Broken for Fine-Grained Manipulation

The paper provides a damning technical diagnosis of why LLM/VLM-based manipulation fails. The core problem isn't intelligence — it's geometric resolution. "Both LLM- and VLM-based previous approaches ultimately rely on language-described explicit constraints that are inherently sparse and ambiguous in 3D space. They struggle to express fine-grained geometric relations, such as relative rotations, contact geometry or precise alignment between interacting objects" (Section 1). This isn't a data problem that scale will fix — it's a representational limitation of language itself.

3D Registration as the Missing Middle Layer

LAMP introduces a pipeline layer that most robotics stacks are missing: a structured way to convert a "goal image" into a precise 6-DoF transformation (rotation and translation in 3D space) of one object relative to another. This cross-state point cloud registration step, combined with hierarchical noise filtering and unified scale alignment, converts what would otherwise be a fuzzy visual subgoal into a precise geometric target. The result: "LAMP achieves lower translation and rotation RMSE despite not relying on mesh" compared to supervised baselines trained specifically for those tasks (Table 1, Section 4.1).

Deployment-Ready Performance Gap

The performance gap against comparable zero-shot systems is not marginal — it's operational. On 10 diverse manipulation tasks: LAMP achieved 66.0% overall success vs. 24.0% for ReKep, 13.0% for CoPa, and 4.0% for VoxPoser (Table 2, Section 4.2). On precision tasks like coin insertion and toast insertion, competing methods scored 0/10 while LAMP scored 5/10 and 6/10. For a systems builder, the difference between 0% and 50% is the difference between a demo and a product.

Scale Consistency Is the Unsung Killer of Precision Manipulation

The paper identifies a subtle but catastrophic failure mode: when you lift a 2D edited image into 3D, the active object (what's moving) and passive object (what it's interacting with) can end up at inconsistent scales, causing spatial offsets that ruin precision tasks. "Even a 1% scale error can result in significant translation offsets that are catastrophic for fine-grained manipulation" (Section 12). LAMP enforces unified scale across both objects by using the passive object as a reference anchor — a specific engineering decision with outsized impact on real-world reliability.

2. Contrarian Perspectives

More Robot Data Won't Fix the Core Problem

The dominant industry hypothesis is that foundation models for robotics (VLAs) just need more data and compute. LAMP challenges this directly. The paper argues the failure isn't data volume — it's that the underlying representation (language tokens, 2D keypoints) cannot express continuous 3D geometry: "The core limitation stems from the discrete and symbolic nature of language, which makes it hard to capture continuous 3D spatial interactions" (Section 1). Under this view, scaling RT-2, OpenVLA, or π0 won't produce reliable coin insertions or lid coverings — the architecture is wrong for the problem, not underfueled.

Video Generation Models Are Worse Than Image Editors for Manipulation

There's significant industry investment in video generation as a path to robot world models (see Kling, Veo, Sora). LAMP directly tested this and found image editors outperform video generators as spatial priors. "Edited-image priors exhibit stronger adherence to semantic constraints and better background consistency, resulting in more reliable and coherent long-horizon manipulation" compared to Kling 1.6 and Veo 3 (Section 11.1, Figure 20). The reasoning: video models introduce temporal inconsistency and hallucinate physical dynamics, while image editors preserve scene geometry and subject identity more faithfully. Companies building robot planning layers on top of video generation should take note.

The Closest Competitor Made a Fundamental Architecture Error

GoalVLA, a concurrent paper with a similar approach, achieved dramatically worse results because it didn't enforce scale consistency between objects: GoalVLA scored 3/10 on lid covering and 1/10 on pencil insertion vs. LAMP's 8/10 and 7/10 (Table 6, Section 12). LAMP's conclusion is that this isn't a tuning issue — "their neglect of scale consistency between active and passive objects throughout the pipeline results in significant spatial offsets" (Table 6 caption). This implies that multiple teams are now building image-editing-to-robot pipelines, but most will fail on precision tasks unless they solve scale alignment correctly.

3. Companies Identified

UFACTORY

Description: Chinese robotics hardware company making the xArm series of collaborative robot arms
Why relevant: LAMP's entire real-world experimental validation runs on the UFACTORY xArm7 with the xArm Gripper G2 — making this the reference hardware platform for the paper's claims
Quote: "Our experiments are conducted on a UFACTORY xArm7 robotic arm equipped with its UFACTORY xArm Gripper G2" (Section 4.2)

Intel (RealSense)

Description: Intel's depth camera product line
Why relevant: The Intel RealSense D435i is the only sensing hardware used — meaning the entire system is designed around commodity RGB-D cameras, not structured-light arrays or lidar
Quote: "An Intel RealSense D435i RGB-D camera is mounted opposite the robot to capture a third-person view of the workspace" (Section 4.2)

Google DeepMind (Gemini)

Description: Google's frontier AI lab and model suite
Why relevant: Gemini 2.5 Flash is one of the two image editing backends tested, and it outperforms QWen on subject consistency — making Google's image editing quality directly relevant to robot manipulation performance
Quote: "Gemini 2.5 Flash demonstrates stronger subject consistency and better adherence to semantic constraints" (Section 11.2); also used as primary editing model cited in Section 3.2 as "Gemini 2.5 Flash Image (Nano Banana)"

Alibaba (Qwen)

Description: Alibaba's large model division
Why relevant: Qwen-Image-Edit is the second editing backend tested and performs better on certain tool-use scenarios (ring stacking, toast cutting) while failing on others — the choice of editing model materially affects task success rates
Quote: "QWen-Image-Edit...struggles with understanding directional relationships (e.g., in Candle insertion) and shows limited scene awareness (e.g., Toast insertion)" (Section 11.2)

Physical Intelligence (π0)

Description: Robot foundation model company founded by Sergey Levine, Chelsea Finn, et al.
Why relevant: π0.5 is cited as a representative VLA that LAMP's approach implicitly challenges — the paper argues VLAs "struggle to handle novel tasks and environments that are entirely different, falling short in open-world manipulation"
Quote: Referenced in §1 and §2 as part of the VLA category that LAMP positions against

InSpatio Research

Description: Research spinout affiliated with Zhejiang University (co-affiliation of corresponding author Guofeng Zhang)
Why relevant: Institutional co-author — this paper is emerging from an industry-adjacent lab, not purely academic; indicates potential commercialization pathway
Quote: Listed as affiliation 2 in author list

4. People Identified

Jingjing Wang

Lab/Institution: State Key Lab of CAD&CG, Zhejiang University
Why notable: Lead author; primary architect of the LAMP pipeline including the cross-state registration and hierarchical filtering systems
Quote: First author; also co-author on VGGT (cited as [78]), the monocular depth estimator used in LAMP — suggesting deep expertise in 3D reconstruction pipelines

Guofeng Zhang

Lab/Institution: Zhejiang University / InSpatio Research (dual affiliation, corresponding author)
Why notable: Senior PI with industry ties; the InSpatio affiliation suggests active technology transfer interest; his lab has published extensively on 3D reconstruction and spatial computing
Quote: Corresponding author, dual-affiliated with InSpatio Research (Section header)

Yuke Zhu

Lab/Institution: Listed as co-author, Zhejiang University affiliation in this paper (note: there is a prominent Yuke Zhu at UT Austin known for robot learning research)
Why notable: Mid-author on the paper; if this is the UT Austin Yuke Zhu, this cross-institutional collaboration signals broad network credibility in the robot learning community
Quote: Listed as co-author (author list)

Chong Bao

Lab/Institution: State Key Lab of CAD&CG, Zhejiang University
Why notable: Co-author with likely expertise in 3D vision given the lab's focus; contributes to the geometric reasoning core of the system
Quote: Listed as co-author (author list)

5. Operating Insights

The Editing Model Is Now a Critical Infrastructure Choice

For any team building an LAMP-like pipeline, the choice of image editing backend is not an implementation detail — it's a key determinant of task success. QWen and Gemini had opposite failure modes: QWen fails on directional reasoning and scene awareness; Gemini fails on tool-use tasks. The ablation in Table 5 shows swings of 7/10 vs. 1/10 on the same task depending on which model is used. A CTO deploying this approach needs a model selection and fallback strategy per task category, not a single model. "We observe that image editing does not always remove the active object from its original location" (Section 11.2) — meaning you also need validation logic to check whether the edit actually encoded the right information before executing.

The Biggest System Bottleneck Is the Image Editing Query, Not Robot Execution

Runtime profiling (Figure 15, Section 9) shows that the image editing API call dominates total latency — the perception and registration components are relatively fast. This has direct implications for deployment architecture: teams should pre-cache edited goal images for recurring task templates, implement async query pipelines, and evaluate on-premise model hosting vs. API calls for latency-sensitive applications. The system follows a "think-before-act" paradigm where heavy computation happens outside the control loop — a design pattern worth adopting broadly.

Point Cloud Noise Handling Must Be Task-Aware, Not Generic

Standard density-based filtering (DBSCAN) fails on robot manipulation point clouds because sensor artifacts ("flying edge points") are spatially close to valid geometry. LAMP's ablation showed that removing their hierarchical 2D-3D fused filter dropped success from 6/10 to 2/10 at the most challenging viewpoint (0° ring stacking, Table 3). The fix requires combining visual features (DINOv3) with spatial clustering — a two-stage approach that most robotics perception pipelines don't implement. "While these flying-edge points are spatially adjacent to valid points, they are far from inliers with similar visual features" (Section 3.3). Engineering teams should audit their point cloud filtering stacks against this failure mode.

6. Overlooked Insights

The System Failure Breakdown Reveals Where Robot AI Actually Breaks

Section 10 and Figure 16 contain an honest post-mortem that most papers bury or omit. The majority of LAMP's failures come from the image editing module — "unintended modifications to task-irrelevant scene elements or a failure to reflect the requested edits" — not from perception, registration, or motion execution. The low-level controller contributed only a minimal fraction of failures. This is a striking finding: once you have good geometry, robot arm execution is largely a solved problem. The unsolved problem is getting foundation models to reliably produce geometrically coherent goal states. Investors evaluating "robot AI" companies should ask specifically how failure rates break down — if the answer is dominated by perception or LLM/VLM failures rather than control failures, that's the actual product risk.

This Approach Is Currently Hard-Limited to Rigid Objects

LAMP explicitly cannot handle deformable or soft-body objects: "LAMP currently handles rigid-body interactions and does not address soft-body or deformable-object manipulation" (Section 5). This is not a minor caveat — it excludes cloth folding, food handling, cable routing, and biological material manipulation, which are among the highest-value targets in warehouse, food service, and healthcare robotics. The approach relies on SE(3) transformations (rigid rotations and translations), which break down when objects deform. Any company evaluating LAMP-style architectures for non-rigid manipulation scenarios should treat this as a hard architectural constraint requiring a separate solution track.