CoStream: Composing Simple Behaviors for Generalizable Complex Manipulation
- 01Composition Over Monoliths for Manipulation Policies
- 02Three Orthogonal Behavior Channels on a Shared SE(3) Interface
- 03Real-World Validation on Contact-Rich Assembly
1. Key Themes
Composition Over Monoliths for Manipulation Policies
The central contribution is a framework that rejects both dominant paradigms in robotic manipulation — rigid classical pipelines and monolithic end-to-end learned policies — in favor of composing simple, independent behaviors on a shared mathematical interface. The paper argues that "complex manipulation capabilities can emerge naturally from the composition of simple, independent behaviors" rather than requiring either a "rigid pipeline or monolithic whole." This is architecturally significant because it means capabilities can be added, swapped, or reconfigured without retraining the entire system.
Three Orthogonal Behavior Channels on a Shared SE(3) Interface
CoStream decomposes manipulation into three composable behaviors, each leveraging a different sensing modality and AI capability: (1) a semantic behavior that uses foundation models to extract spatial constraints (e.g., where does this part need to go?), (2) a predictive behavior that forecasts trajectories by tracking keypoints in "imagined videos" — essentially a video diffusion model generating a plan, and (3) a reactive behavior providing "high-frequency tactile and force corrections." The outputs compose by "right-multiplication into a single pose command at each control step, executed by a compliant controller." This is a clean architectural pattern: semantic reasoning sets the goal, video prediction generates the trajectory, and tactile feedback closes the loop.
Real-World Validation on Contact-Rich Assembly
The paper demonstrates CoStream on "8 real-world tasks spanning everyday manipulation and precision assembly," with the "strongest gains in contact-rich assembly and object transfer." The GPU-into-PCIe-slot task is the flagship example — a task requiring millimeter-level precision that is representative of real manufacturing and electronics assembly workflows. The system also demonstrates "robust recovery from manual perturbations during execution," which is critical for any deployment where the environment is not perfectly controlled.
2. Contrarian Perspectives
End-to-End Policies Are the Wrong Abstraction for Precision Tasks
The prevailing trend in robotics research and startup land is to train larger end-to-end policies (diffusion policies, VLA models) that map pixels directly to actions. CoStream pushes back, arguing that monolithic policies "lack high precision on complex, out-of-distribution tasks unless retrained with new data." The implication is that for contact-rich assembly — the kind of task that actually matters for manufacturing automation — the end-to-end paradigm hits a wall on precision and OOD generalization simultaneously. Investors betting purely on scaling end-to-end policies for industrial manipulation should take note.
Classical Pipelines' Brittleness Is an Architecture Problem, Not a Fundamental Limitation
The paper doesn't dismiss classical pipelines — it argues their failure mode (brittle, task-specific interfaces requiring "costly pipeline redesigns") stems from the assumption that capabilities "must be deployed as a rigid pipeline... rather than being freely decomposed and recomposed." By keeping the modularity of classical approaches but replacing hardcoded interfaces with a composable SE(3) interface, CoStream claims you get the precision of classical methods with the adaptability of learned ones. This challenges companies building either pure-learning or pure-classical stacks.
Video Generation Models Are Manipulation Planners
Using "imagined videos" with keypoint tracking as a predictive behavior is a contrarian use of video diffusion models. Most of the industry treats video generation as a content creation tool. CoStream treats it as a trajectory forecaster — the generated video is a plan, and keypoints extracted from it become waypoints. This suggests video foundation models have latent utility as manipulation planners, not just as simulators or data generators.
3. Companies Identified
No specific companies are referenced in the provided text (abstract only). The paper uses generic descriptions of paradigms (classical pipelines, end-to-end policies) rather than naming commercial systems.
4. People Identified
Haonan Chen
Institution: arXiv (affiliation not specified in provided text) Why notable: Lead author of CoStream, working on composable manipulation frameworks.
Wenlong Huang
Institution: arXiv (affiliation not specified in provided text) Why notable: Previously associated with foundational work on language-conditioned manipulation (e.g., VoxPoser, Inner Monologue at Google/Stanford). His involvement signals continuity in the line of research using foundation models for manipulation planning.
Yilun Du
Institution: arXiv (affiliation not specified in provided text) Why notable: Known for work on compositional generation and multi-objective reasoning with diffusion models at MIT. His presence on this paper connects the composition-of-behaviors idea to his broader thesis that complex AI capabilities emerge from composing simpler, independently-trained modules.
Edward H. Adelson
Institution: arXiv (affiliation not specified in provided text) Why notable: Inventor of the GelSight tactile sensor and a pioneer in computer vision at MIT. His involvement suggests the tactile/reactive behavior channel may leverage GelSight-class sensing, connecting this work to the tactile sensing hardware ecosystem.
Jiajun Wu
Institution: arXiv (affiliation not specified in provided text) Why notable: Prominent researcher at Stanford working on physical reasoning, simulation, and embodied AI. Brings expertise in how AI systems reason about physical world structure.
5. Operating Insights
Design for Composition, Not Integration
If you are building a manipulation stack, the CoStream architecture suggests a specific design pattern: define a shared pose interface (SE(3)) and build independent behavior modules that each output pose corrections. This means your semantic planner, your trajectory forecaster, and your force/torque controller can each be improved, swapped, or scaled independently. The "right-multiplication" composition is mathematically clean and operationally powerful — it means you don't need to retrain anything when you upgrade one module. For a CTO, this reduces the blast radius of any single component failure or upgrade.
Tactile Feedback Is Non-Optional for Contact-Rich Assembly
The paper's "strongest gains in contact-rich assembly" come from a system that includes a "reactive behavior providing high-frequency tactile and force corrections." If your robot is doing precision insertion, snap-fits, or threading tasks, vision-only policies will plateau. The architecture explicitly separates high-frequency reactive control (tactile/force) from lower-frequency semantic and predictive behaviors — this temporal decomposition is a practical blueprint for how to architect control loops in real systems.
6. Overlooked Insights
The "Imagined Video" as a Manipulation Primitive
The predictive behavior "forecasting trajectories by tracking keypoints in imagined videos" is buried in the architecture description but has outsized implications. This means CoStream is using a video generation/diffusion model to produce a visual plan of the manipulation, then extracting keypoints from that generated video to form a trajectory. This is distinct from using video models to generate training data — here, the video model is an online planner running at inference time. If this works reliably, it means any advance in video generation models directly improves manipulation planning, creating a free-riding effect on the massive investment flowing into video AI.
Generalization Without Retraining Is the Real Benchmark
The paper frames its contribution as achieving "out-of-the-box generalization to new tasks" without the "costly pipeline redesigns" of classical methods or the retraining required by end-to-end policies. For investors evaluating manipulation startups, the key question to ask is: does this system require new data or new engineering for each new task? CoStream's claim is that composition itself provides generalization — new tasks are new compositions of existing behaviors, not new training runs. If this holds at scale, it fundamentally changes the unit economics of robotic manipulation deployment.