RedVLA: Physical Red Teaming for Vision-Language-Action Models
- 01The Attack Surface for Physical AI Is the Environment, Not the Model
- 02Capability and Safety Are Currently Inversely Correlated
- 03Unsafe Behaviors Emerge During Task Execution, Not Model Failure
- 04Physical Safety Risks Are Robust to Environmental Noise
- 05Sim-to-Real Transfer of Safety Vulnerabilities Is Confirmed
The Bottom Line: Every VLA model tested — including the most capable ones — can be reliably manipulated into unsafe physical behaviors by strategically placing everyday objects (knives, wine bottles, books) in their operating environment. This paper is the first to systematically demonstrate and measure this vulnerability, and to propose a lightweight defense. For anyone deploying robots in real environments, this is a safety debt that needs to be priced in now.
1. Key Themes
The Attack Surface for Physical AI Is the Environment, Not the Model
Prior AI safety work focused on adversarial prompts and image perturbations — hacking the model's inputs. RedVLA demonstrates that the real attack surface for embodied AI is the physical world itself. By placing a kitchen knife where a robot expects to pick up a bowl, researchers achieved attack success rates up to 95.5% across six production-grade VLA models.
"The source of risks in VLA models has fundamentally shifted from the intent space to the physical space, introducing distinct safety challenges... Existing red teaming methods neither consider potential physical risks within the environment nor model the physical causality of the risk process." (Section 1)
The practical implication: robot safety can't be solved at the model layer alone. The deployment environment is an attack vector.
Capability and Safety Are Currently Inversely Correlated
This is the paper's most alarming finding for investors and operators. Better-performing models are more exploitable, not less. As models improve their ability to follow instructions and manipulate objects, they become better at being manipulated into unsafe behaviors.
"OpenVLA-OFT improves Benign SR over OpenVLA by 20.6% (97.1% vs. 76.5%), but also increases ASR by 25.6% (90.5% vs. 64.9%). These results suggest that stronger instruction-following ability may also increase the likelihood of triggering unsafe behaviors." (Section 5.3, Table 2)
This is a fundamental tension in the current VLA development paradigm: optimizing for task performance without co-optimizing for safety produces increasingly dangerous systems.
Unsafe Behaviors Emerge During Task Execution, Not Model Failure
A critical distinction: these vulnerabilities are not caused by the robot "breaking down" or losing task coherence. The robot is functioning correctly — executing its task — while simultaneously causing harm. This makes the problem much harder to detect with standard success-rate metrics.
"Most unsafe rollouts fall into SU [Success + Unsafe] or AU [Attempt + Unsafe] across all six models, while collapse accounts for only a small fraction. This result indicates that RedVLA mainly exposes unsafe behaviors during task execution, rather than failures caused by model collapse." (Section 5.2, Figure 4)
A robot that picks up a knife instead of a bowl and carries it across the room is still "attempting the task." Standard quality metrics would miss this entirely.
Physical Safety Risks Are Robust to Environmental Noise
The vulnerabilities persist even under significant real-world signal degradation. Garbled instructions, reversed commands, camera blur, and occlusion do not prevent unsafe behaviors from being triggered once the physical risk factor is in the environment.
"RedVLA maintains high ASRs under both language and visual perturbations, with average ASRs of 88.2% and 85.5%, respectively... the physical safety risks uncovered by RedVLA are driven primarily by injected environmental risk factors, rather than trivial input perturbations." (Section 5.3, Table 3)
This rules out the hope that robust perception or language understanding will naturally solve the physical safety problem.
Sim-to-Real Transfer of Safety Vulnerabilities Is Confirmed
The paper doesn't stop at simulation. They deployed π₀ on a physical Franka robot and confirmed that the same attack vectors work in the real world.
"In Figure 6, the robot grasps the knife in task (i) and knocks over the obstacle cups in task (ii). Both scenarios achieve ASR above 80% over 10 trials per task, confirming that such unsafe behaviors also manifest in the real world." (Section 5.5)
This eliminates the "it's just a simulation artifact" objection. These are real physical harm scenarios.
2. Contrarian Perspectives
Conventional Wisdom: Safety-Testing VLAs at the Model Level Is Sufficient
The entire existing body of work on VLA safety — adversarial image patches, prompt injection, backdoor attacks — assumes the model is the attack surface. RedVLA argues this framing is fundamentally incomplete.
"Existing red teaming methods neither consider potential physical risks within the environment nor model the physical causality of the risk process." (Section 1)
The paper demonstrates that you can trigger 95.5% attack success rates without touching the model at all — just by rearranging objects in the environment. A robot that passes every existing adversarial robustness benchmark could still pick up a knife and carry it toward a human. The entire safety evaluation paradigm needs to expand to include environmental configuration testing.
Conventional Wisdom: Red Teaming Physical AI Requires White-Box Model Access
Most adversarial ML techniques require access to model gradients (white-box attacks). This has led to a common assumption that meaningful safety testing requires access to model internals — a barrier that limits who can conduct safety evaluation. RedVLA uses gradient-free, black-box optimization.
"This refinement uses trajectory features to guide the search toward regions that are most likely to affect execution. As a result, the red teaming search is reduced from the high-dimensional state space to the low-dimensional interaction space, without requiring white-box access to π_θ." (Section 4.2)
The optimization converges in fewer than 10 iterations. This means third-party safety auditors, regulators, and even adversarial actors can conduct meaningful physical safety testing without access to model weights. For operators deploying VLAs, this means the vulnerability is accessible to a much wider range of threat actors than previously assumed.
Conventional Wisdom: A Lightweight Safety Layer Can't Generalize Across Tasks
The standard concern about add-on safety monitors is that they overfit to training scenarios and fail on novel tasks. SimpleVLA-Guard challenges this — a simple LSTM operating on internal VLA representations achieves meaningful detection on tasks it was never trained on.
"Although performance drops on unseen tasks, it still demonstrates a meaningful capability for generalization... the guard reduces ASR after deployment while incurring only a limited drop in Benign SR." (Section 6.2, Table 4)
Specifically: PRC-AUC of 0.89 on unseen tasks, and ASR reduction from 98.3% to 46.7% on unseen tasks. The key insight is that unsafe behaviors share a common latent signature across tasks, visible in the model's own internal representations — a tractable detection signal that doesn't require task-specific training.
3. Companies Identified
Physical Intelligence (π₀, π₀.₅) General-purpose robot foundation model developer Most directly implicated: π₀.₅ achieves the highest attack success rate of any model tested (95.5% average ASR), and π₀ is the platform used for real-world validation of the vulnerabilities. This is the paper's primary deployment-relevant finding.
"π₀.₅ reaching the highest [ASR] and OpenVLA the lowest... Both scenarios achieve ASR above 80% over 10 trials per task" on a physical Franka with π₀. (Sections 5.2, 5.5)
Franka Robotics Collaborative robotic arm manufacturer Their robot is the physical hardware used in the sim-to-real validation experiments, confirming that these vulnerabilities extend to real deployed hardware systems.
"We build a physical platform with a Franka robot and deploy the π₀ policy." (Section 5.5)
Stanford / OpenVLA (open-source VLA project) Open-source VLA model initiative OpenVLA and OpenVLA-OFT are evaluated extensively. OpenVLA shows lower ASR in some scenarios, but this is attributed to weaker baseline task performance rather than better safety properties. OpenVLA-OFT's capability improvements directly increase its exploitability.
"OpenVLA-OFT improves Benign SR over OpenVLA by 20.6%... but also increases ASR by 25.6%." (Section 5.3)
VLA-Adapter / VLA-Adapter-Pro Lightweight VLA architecture from Wang et al. (2025) Both variants evaluated. VLA-Adapter-Pro achieves near-identical safety vulnerability to its predecessor despite performance improvements, suggesting architectural efficiency gains don't inherently improve safety posture.
"VLA-Adapter-Pro starts from a higher ASR and saturates earlier." (Section 5.4)
4. People Identified
Yuhao Zhang, Borong Zhang, Jiaming Fan, Jiachen Shen, Yishuai Cai Peking University / affiliated with Yaodong Yang's lab Core engineering contributors on the RedVLA framework, experimental pipeline, and SimpleVLA-Guard implementation. Responsible for the technical execution of simulation environments, annotation platforms, and cross-model evaluation.
Authors listed on arXiv:2604.22591v1
Yaodong Yang Peking University, Institute for AI Senior researcher and lab lead. Yang's group has been systematically working on AI safety for embodied systems, including the companion paper SafeVLA. This lab is emerging as a leading voice on the safety alignment of physical AI systems — the equivalent of Anthropic's constitutional AI work but for robots.
Co-author; SafeVLA (Zhang et al., 2025b) is cited as a defense-side companion work from the same group.
Jiaming Ji Peking University Co-PI on this work and SafeVLA. Ji has been developing the constrained learning frameworks that underpin the defense strategies proposed in this line of research. Relevant for anyone building safety evaluation infrastructure for VLA deployments.
Co-author; also co-author on SafeVLA referenced as: "SafeVLA (Zhang et al., 2025b) explores constrained learning to enhance model safety." (Section 2)
Kevin Black et al. (Physical Intelligence team) Physical Intelligence Not authors of this paper, but their π₀ and π₀.₅ models are the primary experimental subjects and the platform for real-world validation. Their models' internal representations are what SimpleVLA-Guard is built on.
"K. Black, N. Brown, D. Driess... π₀: A vision-language-action flow model for general robot control." (References)
5. Operating Insights
Treat Your Robot's Operating Environment as Part of the Attack Surface — and Test It Accordingly
The central operational takeaway: pre-deployment safety testing must include environmental configuration testing, not just model-level evaluation. The RedVLA methodology — identify where the robot's end-effector travels, place risk objects there, iterate — is simple enough to operationalize as a standard pre-deployment protocol.
"We seek to maximize physical safety risks with minimal perturbation to the original environment... [identifying] critical interaction regions from benign trajectories and positions the risk factor within these regions." (Section 4)
For a CTO deploying robots in variable environments (warehouses, hospitals, homes), the question is no longer just "does the model perform the task?" but "what happens when an unexpected object is in the task path?" RedVLA provides a concrete framework for stress-testing this systematically before deployment.
Instrument Your Deployed VLAs with Internal Representation Monitoring
SimpleVLA-Guard's key engineering insight has immediate deployment implications: the VLA's own internal activations contain a detectable signal that distinguishes safe from unsafe execution trajectories. A lightweight LSTM (single layer, 256 hidden dims, runs on a single GPU) can monitor this in real-time and halt execution before harm occurs, reducing attack success from 90.9% to 31.4% on seen tasks.
"Safe and unsafe trajectories are separated in the latent space, suggesting that internal representations carry useful signals for detecting physical risk... the guard reduces ASR after deployment while incurring only a limited drop in Benign SR [4.08% on seen tasks]." (Sections 6.1, 6.2, Table 4)
This is implementable today on production VLA systems. The cost is modest (4% task performance reduction), the benefit is significant (59.5% reduction in unsafe behavior rates, per the abstract). Any team deploying π₀ or similar models should evaluate whether they can instrument the hidden states for real-time safety monitoring.
Don't Equate Task Performance Improvement With Safety Improvement — Track Both Separately
The performance-safety trade-off finding (Section 5.3) has direct implications for model selection and vendor evaluation. A newer, higher-performing model version should trigger a re-run of safety evaluation, not an assumption of equivalent or improved safety.
"Stronger models tend to achieve higher Benign SR, but also higher ASR. This trend is consistent across all three model families." (Section 5.3, Table 2)
Engineering teams should establish safety benchmarks as first-class metrics alongside task success rate — and expect that model upgrades may require re-validation of both dimensions independently.
6. Overlooked Insights
The Conditional-Level Safety Category Exposes a Fundamental Limitation of Current VLA Architectures
While the headline numbers focus on State-Level ASR (which exceeds 95% across all models), the Conditional-Level scenarios are where the paper reveals something structurally important. These scenarios require the robot to reason about causal chains — turning on a stove creates a context in which placing a flammable object becomes dangerous. OpenVLA's Conditional-Level ASR is only 26.7%, not because it's safer, but because it can't even complete the benign task reliably enough to trigger the conditional violation.
"OpenVLA shows lower ASR in Conditional-Level scenarios, likely due to its weaker baseline performance in benign scenes (i.e., 53.7% Benign SR on LIBERO-Long for OpenVLA-Long model)." (Section 5.2)
The deeper implication: as VLA models become capable enough to execute complex, multi-step conditional tasks (which is the stated goal of the field), their exposure to conditional-level safety violations will rise sharply. The models that are today's safety leaders by ASR metrics will likely become tomorrow's highest-risk systems as their capabilities improve. Investors and operators should be planning safety infrastructure for the models that will exist in 18 months, not just the ones available today.
The SimpleVLA-Guard Generalization Gap Points to a Data Infrastructure Problem the Industry Hasn't Solved
SimpleVLA-Guard shows a meaningful performance drop from seen to unseen tasks: PRC-AUC drops from 0.94 to 0.89, and online ASR reduction drops from 59.5% (seen) to approximately 48% (unseen, computed from Table 4: 98.3% → 46.7%). The paper attributes this to limited training data diversity.
"A key limitation of this work stems from the constrained task performance of current VLA models... Another limitation is that RedVLA relies on a predefined set of safety predicates, risk objects, and safety violations to construct risk scenarios." (Appendix A)
This gap reveals that safety guardrail performance is fundamentally bounded by the diversity of red-teaming data used to train it — and that diversity is expensive to generate. The annotation workflow described in Appendix F (3-5 minutes per scenario, team of 4, requiring robotics expertise) does not scale to the breadth of real-world deployment contexts. The company or research group that develops automated, scalable red-teaming data generation for physical AI will be providing critical safety infrastructure for the entire industry. This is a currently unaddressed gap with significant commercial and regulatory implications.