TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation
- 01Zero-Shot Planning Beats Fine-Tuned VLAs on Complex Tasks
- 02Modular Architecture Enables Debuggable, Upgradable Systems
- 03Speed Advantage from Pre-Planned Open-Loop Execution
- 04Cross-Embodiment Deployment in Hours, Not Weeks
- 05Complementary Failure Profiles Suggest a Hybrid Architecture Roadmap
MIT CSAIL + University of Pennsylvania | arXiv 2603.09971 | March 2026
1. Key Themes
Zero-Shot Planning Beats Fine-Tuned VLAs on Complex Tasks
The central result is striking: TiPToP, which requires zero robot training data, matches or outperforms π₀.₅-DROID — a state-of-the-art VLA fine-tuned on 350 hours of embodiment-specific demonstrations — across 28 tabletop manipulation tasks. The performance gap widens as task complexity increases. On distractor tasks (finding the right object among clutter), TiPToP achieves 60% vs. 26.7% success rate. On semantic tasks (e.g., "pick the largest toy," "sort blocks by color"), TiPToP achieves 65% vs. 25% success rate. On multi-step tasks, TiPToP achieves 57.5% vs. 15% success rate.
"TiPToP achieves comparable or better success rates across diverse tasks, and we find that the two systems fail in complementary ways." (Section I)
"On multi-step tasks, TiPToP achieves a higher success rate on six of seven scenes, with the largest difference in the simulated color cubes scene (9/10 vs. 0/10)." (Section VII-B)
Modular Architecture Enables Debuggable, Upgradable Systems
TiPToP is built as three swappable components: a perception module (FoundationStereo + M2T2 + Gemini + SAM-2), a planning module (GPU-parallelized TAMP via cuTAMP), and an execution module (joint impedance controller). Each component can be replaced independently as better models emerge. Critically, this modularity enables root-cause failure analysis — something end-to-end VLAs cannot provide. When the system fails, engineers know exactly which module to fix.
"TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement." (Abstract)
"Grasping failures (31/55 failures) are the most common failure mode... Scene completion errors (13/55 failures) result from incorrect mesh approximations... VLM errors (6/55 failures)... cuTAMP failures (5/55 failures)." (Section VII-E)
Speed Advantage from Pre-Planned Open-Loop Execution
TiPToP is consistently faster at task completion than a reactive VLA. On single-step real-world tasks, TiPToP completes in ~15 seconds vs. ~32 seconds for π₀.₅-DROID — roughly 2x faster. The speed advantage is most pronounced when the VLA struggles (e.g., can-to-mug: 18.6s vs. 41.0s). The insight is that planning a single optimal trajectory upfront and executing it open-loop eliminates the idle time and retry cycles endemic to reactive policies.
"On single-step real-world tasks, TiPToP completes execution in around 15 seconds, roughly half the time of π₀.₅-DROID... we observed qualitatively that π₀.₅-DROID spends a significant amount of time idling and seemingly not making any task progress." (Section VII-B)
Cross-Embodiment Deployment in Hours, Not Weeks
The system was adapted to a UR5e arm in "approximately 2–3 hours" and deployed on a Trossen WidowX AI by an independent researcher — no involvement from the original team required. The deployment checklist is explicit and minimal: robot URDF, collision spheres, cuRobo config file, camera interface, and controller interface. This is a direct challenge to the narrative that robotics systems require weeks of platform-specific integration work.
"It can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort." (Abstract)
"Given an existing robot controller, we completed all changes in approximately 2–3 hours." (Appendix D)
Complementary Failure Profiles Suggest a Hybrid Architecture Roadmap
The paper's most strategically interesting finding is not who wins, but how each system fails. TiPToP fails when grasps slip mid-execution (no recovery mechanism) and when objects have concave geometry (banana). π₀.₅-DROID fails at multi-step reasoning, distractor rejection, and semantic grounding. These failure modes don't overlap — which means combining them is a tractable engineering problem, not a research moonshot.
"TiPToP excels at geometric reasoning, long-horizon sequencing, and semantic grounding, but fails when grasps slip or meshes are poorly approximated; π₀.₅-DROID benefits from closed-loop reactivity but struggles with multi-step structure, tight constraints, and distractor-rich scenes." (Section VIII)
2. Contrarian Perspectives
More Training Data Does Not Buy You Task Generalization
The conventional wisdom in Physical AI is that VLAs trained on massive, diverse datasets will eventually generalize to arbitrary tasks. TiPToP directly challenges this: a system with zero robot training data outperforms a VLA trained on 350 hours of demonstrations on the majority of non-trivial tasks tested. The failure of π₀.₅-DROID on semantic tasks is particularly damning — it scored 0/5 on four of eight semantic scenes, including tasks where understanding "sort blocks by color" or "pick the largest toy" would seem basic.
"π₀.₅-DROID scores 0/5 on four of [the semantic tasks]. We attribute this performance to TiPToP's use of a large VLM to translate visual observations and natural language instructions into a symbolic goal 𝒢." (Section VII-B)
The implication: for tasks requiring semantic grounding or multi-step reasoning, scaling demonstration data may be the wrong lever. Structured planning over explicit symbolic goals appears to be a more sample-efficient path.
Closed-Loop Reactivity Is Overrated for Structured Tasks
The dominant architectural assumption in robotics is that real-world deployment requires closed-loop, reactive control — constant visual feedback to correct errors in real time. TiPToP executes entirely open-loop after a single perception step, and still outperforms a closed-loop VLA on most tasks. The key insight is that for tasks where geometric constraints can be pre-computed and trajectories can be tracked accurately, reactivity adds latency and complexity without improving outcomes.
"TiPToP does not monitor execution or replan based on execution-time observations (i.e., it is open-loop with respect to visual observations). This succeeds when the world is static and trajectories are tracked accurately." (Section III-B)
The caveat is real — open-loop execution fails on slippery objects and small targets — but the point stands: open-loop planning-and-execute is a legitimate architecture for structured environments, not just a research simplification.
TAMP Is Practically Deployable Without Years of Integration Work
TAMP has historically been dismissed as a lab technique — requiring precise object CAD models, hand-engineered domain knowledge, and weeks of hardware integration. TiPToP directly refutes this. By combining GPU-parallelized TAMP (cuTAMP) with foundation model perception, the team deploys TAMP on a new hardware platform in hours, with no pre-specified object geometries, using only RGB images as input.
"There has been substantial research demonstrating TAMP in the real world, but these systems lack generality and have generally relied on implementations that are tightly coupled to specific hardware, perception, and control stacks, making them difficult to access and build upon." (Section I)
TiPToP's counter-example directly challenges this: the system runs on DROID, UR5e, and Trossen WidowX platforms, was deployed by an external team not involved in development, and extended to a new skill (whiteboard wiping) in under a day.
3. Companies Identified
Physical Intelligence (π.ai)
- Description: AI robotics company building generalist robot policies
- Why relevant: π₀.₅-DROID is the primary benchmark competitor throughout the paper. The system is described as "a vision-language-action flow model" fine-tuned on 350 hours of DROID demonstrations
- Quote: "We evaluate TiPToP... and find it matches or outperforms π₀.₅-DROID, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations." (Abstract)
Google DeepMind (Gemini)
- Description: AI lab, developer of the Gemini family of multimodal models
- Why relevant: TiPToP's semantic branch uses Gemini Robotics-ER 1.5 as its VLM backbone for object detection, goal grounding, and bounding box localization. Its common-sense reasoning is explicitly cited as the reason TiPToP outperforms π₀.₅-DROID on semantic tasks
- Quote: "We query Gemini Robotics-ER 1.5, a VLM, once to jointly extract: (1) labels and 2D bounding boxes for objects in the scene, and (2) a symbolic goal 𝒢." (Section IV-B)
NVIDIA
- Description: GPU computing and robotics infrastructure company
- Why relevant: Two NVIDIA technologies are core infrastructure components. cuRobo (GPU-accelerated motion planner) handles collision-free trajectory generation. IsaacSim is used as the simulation environment for 5 of 28 evaluation scenes. Evaluations run on NVIDIA L4, RTX 3080, and RTX 4090 GPUs
- Quote: "cuTAMP invokes cuRobo, a GPU-accelerated motion planner, to solve for the remaining trajectory parameters as collision-free, time-parameterized trajectories." (Section V)
Meta AI (SAM / SAM-2)
- Description: AI research lab, developer of Segment Anything Model
- Why relevant: SAM-2 is a core perception component, generating pixel-level segmentation masks for every detected object. Without it, TiPToP cannot extract per-object geometry for planning
- Quote: "For each detected bounding box, we use SAM-2 to generate a pixel-level segmentation mask from the left image." (Section IV-B)
Universal Robots
- Description: Collaborative robot arm manufacturer
- Why relevant: TiPToP was deployed on a UR5e arm, with the controller implemented via Universal Robots' RTDE interface. This validates cross-embodiment claims beyond the Franka ecosystem
- Quote: "We implement a joint-space trajectory tracking controller using the Universal Robots servoJ primitive via the RTDE interface." (Appendix D)
Stereolabs (ZED Mini)
- Description: Stereo camera manufacturer
- Why relevant: The ZED Mini is the primary stereo camera in the DROID hardware setup. Interestingly, the paper notes that FoundationStereo outperforms the ZED's own proprietary stereo matching on difficult surfaces
- Quote: "We found that FoundationStereo produces cleaner depth maps than the ZED camera's proprietary stereo matching, particularly on transparent, specular, and textureless surfaces." (Section IV-A)
Intel (RealSense)
- Description: Computing hardware company, manufacturer of RealSense depth cameras
- Why relevant: The RealSense D435 and D405 are used for UR5e and Trossen WidowX deployments respectively. The paper notes active IR stereo on RealSense produces noisier depth than RGB stereo on ZED Mini, particularly on reflective surfaces
- Quote: "This qualitatively resulted in noisier depth estimates than the DROID setup... active IR stereo struggles with such surfaces because the projected pattern does not reflect reliably." (Appendix D)
Trossen Robotics
- Description: Manufacturer of low-cost robot arms
- Why relevant: TiPToP was independently deployed on a Trossen WidowX AI arm by an external researcher, validating deployment on low-cost, non-research-grade hardware
- Quote: "TiPToP was also deployed on a Trossen WidowX AI arm with a wrist-mounted RealSense D405 camera in collaboration with an independent researcher." (Section VII-C)
4. People Identified
William Shen
- Lab/Institution: MIT CSAIL (equal first author)
- Why notable: Lead developer of cuTAMP, the GPU-parallelized TAMP system that is the planning backbone of TiPToP. Also a co-author on the cuTAMP paper (RSS 2025). Represents a new generation of researchers making classical planning methods computationally viable for real-world deployment
- Quote: "William adapted and improved the core cuTAMP system to be suitable (simpler to use, faster) for our purposes." (Author Contributions)
Nishanth Kumar
- Lab/Institution: MIT CSAIL (equal first author)
- Why notable: Led perception integration, specifically the Gemini and SAM-2 interfaces. Co-author on OWL-TAMP (arXiv 2024), which uses VLMs to infer TAMP constraints — a predecessor idea to TiPToP's semantic branch
- Quote: "Nishanth implemented the perception interface to Gemini and SAM." (Author Contributions)
Leslie Pack Kaelbling
- Lab/Institution: MIT CSAIL
- Why notable: One of the most influential researchers in robot task and motion planning, co-inventor of the TAMP framework this work builds on. Her presence on this paper signals that TiPToP represents a serious attempt to make TAMP production-viable, not just a benchmark exercise
- Quote: "Leslie Pack Kaelbling and Tomás Lozano-Pérez... strongly encouraged that the code should be easy to install." (Author Contributions)
Tomás Lozano-Pérez
- Lab/Institution: MIT CSAIL
- Why notable: Co-developer of hierarchical TAMP with Kaelbling; decades of foundational work on robot manipulation planning. His involvement reinforces the paper's positioning as a bridge between classical planning theory and modern foundation model deployment
- Quote: Co-authored foundational TAMP papers cited throughout, including "Hierarchical task and motion planning in the now" (ICRA 2011)
Dinesh Jayaraman
- Lab/Institution: University of Pennsylvania
- Why notable: Led the external evaluation team at Penn — a methodologically important role. His group independently deployed TiPToP and ran blind comparisons against π₀.₅-DROID, lending credibility to results that might otherwise be dismissed as internally biased
- Quote: "Dinesh Jayaraman advised the evaluations at the University of Pennsylvania and provided lab resources." (Author Contributions)
5. Operating Insights
Grasp Failure Is the Dominant Bottleneck — Not Planning, Not Perception
In 173 real-world trials, the failure analysis is unambiguous: 56% of all failures (31/55) are grasp failures — either M2T2 predicts high-confidence grasps that fail in execution, or the heuristic fallback sampler is used when M2T2 has no prediction. Planning failures (cuTAMP) account for only 9% of failures. This means teams building manipulation systems should prioritize grasp model quality and recovery mechanisms far ahead of planning sophistication.
For operators: if you're deploying a pick-and-place system and your success rate is below target, the problem is almost certainly in grasp execution, not in task planning. The second-biggest failure category — scene completion errors (24% of failures) from poor mesh approximation — is addressable with multi-view perception or learned shape completion.
"Grasping failures (31/55 failures) are the most common failure mode... The most direct improvement is to re-run perception and planning after each pick-and-place step, enabling recovery from failed grasps or unexpected object movement." (Sections VII-E and VIII)
Open-Loop Execution With a Custom Controller Can Beat Closed-Loop VLAs — But Only With Precise Trajectory Tracking
TiPToP's speed advantage over π₀.₅-DROID depends critically on accurate trajectory execution. The team found that existing open-source controllers including DROID's default Polymetis controller were insufficient and had to implement a custom joint impedance controller. Even this controller exhibits "typically up to 5mm position error" at high speeds. This is a practical warning for teams planning to deploy planning-based systems: the execution stack is not a solved problem, and controller quality directly determines whether geometric plans succeed in the real world.
"We implemented our own joint-space impedance controller for Franka arms, because existing open-source controllers, including DROID's default Polymetis controller, were unable to track timed trajectories sufficiently precisely." (Section VI)
For CTOs evaluating TAMP-based systems: don't assume the robot's built-in controller is adequate. Budget engineering time for controller tuning, and treat trajectory tracking precision as a first-class system requirement.
VLMs Are Now Good Enough to Be the Semantic Backbone of Manipulation Systems
The practical takeaway from the semantic task results is that large VLMs (specifically Gemini Robotics-ER 1.5) can reliably handle the hard part of language grounding that VLA policies struggle with. TiPToP calls Gemini once per task to get object detections and a symbolic goal — and this single inference step is sufficient to enable tasks that π₀.₅-DROID completely fails (0/5 on four semantic scenes). This means the "what to do and to which object" problem is largely solved by frontier VLMs; the unsolved problem is "how to physically execute it reliably."
"We attribute this performance to TiPToP's use of a large VLM to translate visual observations and natural language instructions into a symbolic goal 𝒢. This explicit grounding step enables TiPToP to correctly identify task-relevant objects amid distractors and to interpret complex referring expressions." (Section VII-B)
6. Overlooked Insights
The External Evaluation Protocol Is Methodologically Significant — and Rarely Done
Most robotics papers are evaluated by their own developers, creating a systematic bias toward favorable conditions and task selection. TiPToP's authors sent their code to an independent team at the University of Pennsylvania — not involved in development — who independently deployed the system and ran 15 of the 28 evaluation scenes. The fact that TiPToP still outperforms π₀.₅-DROID under independent evaluation is substantially more credible than a self-reported result.
"To validate TiPToP's accessibility and generalizability, we sent our code to an external evaluation team not involved in its development. This team independently deployed TiPToP on the DROID hardware platform and conducted a systematic comparison against π₀.₅-DROID." (Section I)
This evaluation design choice should become an industry standard for Physical AI system claims. Investors and operators should weight results from independently-verified evaluations significantly higher than self-reported benchmarks. The fact that the MIT team proactively chose this methodology is itself a signal about confidence in the system.
The System Has a Structural Blind Spot: Single-Viewpoint Perception at Plan Time
TiPToP observes the scene once, at a fixed capture pose, and then executes entirely open-loop with no further perception. This means any object not visible from that single wrist-camera pose is simply absent from the plan. The paper acknowledges this as a fundamental limitation — but its full operational implications are understated. In real deployments (warehouse picking, kitchen manipulation, industrial assembly), partial observability from a single viewpoint is the rule, not the exception.
The paper proposes multi-view perception as a fix but does not implement it. More importantly, the convex hull mesh approximation — which causes 24% of failures — is a direct consequence of single-viewpoint geometry: with only one view, the system cannot distinguish concave surfaces, and defaults to the most conservative (over-approximated) geometry.
"Single-viewpoint perception. All task-relevant objects must be at least partially visible from a single wrist-camera pose. This also limits mesh quality: with only one viewpoint, convex hull completion can over or under-approximate object geometry... Multi-view perception, via active camera movement before planning or additional static cameras, would reduce occlusions and improve shape estimates." (Section VIII)
For teams evaluating this system or building on it: the tabletop benchmark conditions (objects arranged in clear view, flat surfaces, controlled lighting) are maximally favorable to single-viewpoint perception. Real-world deployment will require either active camera exploration before planning or additional fixed cameras — neither of which is currently implemented.