FurnitureVLA: Learning Long-Horizon Bimanual Furniture Assembly with Vision-Language-Action Model
1. Key Themes
Real-Scale Bimanual Furniture Assembly via VLAs
The paper demonstrates the first systematic application of Vision-Language-Action (VLA) models to real-scale, dual-arm furniture assembly. Unlike prior work that focused on "toy-scale settings or single-arm manipulation," this system tackles full-size IKEA furniture (LACK table, KALLAX shelf, IVAR chair) requiring up to 7 subtasks and 1,550 control steps. The authors state: "We introduce FurnitureVLA, the first systematic study of real-scale bimanual furniture assembly using Vision-Language-Action models (VLAs)."
Progress-Enhanced VLA for Long-Horizon Execution
To prevent the compounding errors that typically plague long-horizon robotic tasks, the authors introduce a VLA that jointly predicts actions and a continuous progress signal. This allows the system to automatically transition between subtasks without an external stage estimator. The paper notes: "We propose a progress-enhanced VLA that jointly predicts actions and a subtask progress signal, enabling stable long-horizon execution via automatic subtask transitions."
Systematic Study of Precision Design Factors
The authors conduct a focused study on how perception and control parameters affect assembly success. They found that temporal ensembling, action horizons, camera viewpoints, and image resolution materially impact performance. Specifically, "FurnitureVLA improves average simulation success from 48% to 80% compared to baselines across three furniture types, with an additional 21% gain from our design factor study."
2. Contrarian Perspectives
Monolithic VLA Finetuning is Insufficient for Long-Horizon Tasks
While the trend in Physical AI is to train large, monolithic policies on full demonstrations, this paper shows that approach fails on complex, long-horizon assembly. Finetuning the base π0.5 model on full-length demonstrations without subtask decomposition yielded only a 48% average success rate, dropping to 11% on the KALLAX shelf. The authors argue: "directly finetuning a monolithic VLA policy on full-length bimanual demonstrations remains suboptimal, as current VLA models are most effective on short-horizon tasks with relaxed geometric precision requirements."
Discrete Progress Signals Fail; Continuous Signals are Required
A common approach to task segmentation is to use discrete labels for subtasks. The authors tested this and found it completely fails (0% success across all furniture). They explain: "We attribute this to the visual similarity between states where a part is nearly assembled and assembled, making discrete transitions difficult to detect." A continuous progress signal from 0 to 1 is necessary to provide smooth, unambiguous supervision.
3. Companies Identified
- Physical Intelligence: Creators of the π0.5 VLA backbone used in this research. Relevant because FurnitureVLA is built by finetuning π0.5, showing its adaptability to complex, long-horizon tasks. Quote: "We use π0.5 as the VLA backbone, finetuned for 40,000 steps on 8 NVIDIA L40S GPUs."
- Kinova: Manufacturer of the Gen3 7-DoF robot arms used in the dual-arm setup. Relevant as the hardware platform validating the sim-to-real transfer. Quote: "Assembly is performed on a tabletop using two Kinova Gen3 7-DoF robot arms."
- Robotiq: Provider of the grippers (Hand-E and 2F-85) used on the Kinova arms. Relevant for understanding the end-effector constraints in real-world assembly. Quote: "...with a Robotiq Hand-E gripper on the left arm and a Robotiq 2F-85 gripper on the right."
- Meta: Creators of the Quest 3 headset used for VR teleoperation. Relevant for single-operator bimanual control data collection. Quote: "The teleoperator wears a Meta Quest 3 headset at the neck to track hand poses."
- IKEA: The furniture brand whose products (LACK, KALLAX, IVAR) serve as the benchmark tasks. Relevant as a proxy for real-world consumer goods assembly. Quote: "We study real-scale bimanual furniture assembly using three IKEA items of increasing difficulty."
- NVIDIA: Providers of Isaac Gym simulation and L40S GPUs. Relevant as the compute and simulation infrastructure. Quote: "We use Isaac Gym... finetuned for 40,000 steps on 8 NVIDIA L40S GPUs."
4. People Identified
- Chenyang Ma: University of Oxford / Mitsubishi Electric Research Laboratories (MERL). Lead author, focusing on VLA models for long-horizon manipulation. Quote: Co-authored the progress-enhanced VLA framework and subtask transition logic.
- Diego Romeres: Mitsubishi Electric Research Laboratories (MERL). Corresponding author and principal researcher. Relevant as a key figure in MERL's robotics and AI division, driving applied industrial robotics research. Quote: Co-authored the system design and experimental validation.
- Chiori Hori: Mitsubishi Electric Research Laboratories (MERL). Corresponding author. Relevant for her expertise in multimodal AI and language-action models. Quote: Co-authored the VLA architecture and inference mechanisms.
5. Operating Insights
Define Subtask Boundaries at Stable, Contact-Free States
When decomposing long-horizon tasks, do not segment at the moment of assembly completion. Contact-rich states are highly sensitive to small errors, which amplify across rollouts. Instead, segment after the robot retreats. The paper states: "We define subtask boundaries after retreat rather than immediately after assembly... By contrast, post-retreat states are free of contact and force constraints. Small errors are less likely to amplify, yielding a narrower, more consistent initial-state distribution."
Optimize Perception and Control Loops for Precision
Generalist policies often lack the precision for assembly. CTOs should tune temporal ensembling (averaging overlapping action predictions) and increase image resolution. The authors found: "temporal ensembling consistently improves assembly success, with λ = −0.1 performing best" and "higher image resolution consistently improves performance, highlighting the importance of visual precision in assembly." Specifically, upscaling the vision backbone from 224x224 to 448x448 was crucial.
Design VR Teleoperation for Bimanual Efficiency
For data collection, decouple translation and rotation control, use predefined grasp primitives (snapping to 90° orientations), and implement synchronized bimanual control. This reduces operator burden and ensures high-quality demonstrations. The authors note: "We introduce a synchronized mode in which both arms execute mirrored commands simultaneously, enabling efficient repositioning, alignment, and rotation of large and heavy furniture components."
6. Overlooked Insights
Simulation Workarounds for Rigid Body Constraints
The paper reveals a significant limitation in current simulators like Isaac Gym: the inability to dynamically weld parts together at runtime. To simulate the IVAR chair assembly, the authors had to split the task into two stages, manually respawning the partially assembled frame as a single rigid mesh. Quote: "Isaac Gym does not support runtime weld constraints... We therefore split the IVAR chair episode into two stages... The staging is purely an engineering workaround for the absence of weld constraints." This implies that sim-to-real pipelines for complex assembly still require heavy custom engineering.
Real-World Precision Tolerances are Extremely Tight
Even when bypassing screws with magnets, the physical tolerances for success are brutal. A 1 cm deviation can snap together, but 1.5 cm or a 10° tilt causes failure. Furthermore, gripper clearance is minimal (5 cm gripper opening vs. 3.7 cm part thickness), and up to 8 magnets must be aligned simultaneously. Quote: "a 1 cm deviation can snap successfully, but 1.5 cm or a 10° tilt causes parts to fall apart." This highlights that "real-scale" assembly demands sub-centimeter precision even in simplified setups.