In the AI debate, “capability” is often treated like a single scoreboard: how well does the system do on benchmarks, how fast is it improving, how broadly can it generalize? But a growing thread in risk-focused discussions argues that this framing quietly smuggles in an assumption—that the same measurement lens applies to every kind of harm. It doesn’t.
The uncomfortable truth is that what you should worry about determines what you should measure. And when people talk about “dangerous outcomes,” they frequently blur two categories that behave very differently in the real world: dangerous success and dangerous failure. Both can be catastrophic. But they don’t map cleanly onto the same metrics, because they arise from different mechanisms.
Dangerous success is the scenario where an AI system becomes more capable in ways that make harmful actions easier, cheaper, or more effective. Dangerous failure is the scenario where the system’s inability—or its misbehavior—still produces harm, even if it never reaches the “high capability” threshold people imagine. In other words: one path to disaster runs through competence; the other runs through breakdowns, misalignment with context, and failure modes that remain dangerous even when performance is imperfect.
This distinction matters because it changes how you interpret progress. A system can look “better” on some tests while simultaneously becoming more dangerous in a specific operational setting. Or it can look “worse” on certain benchmarks while still being capable enough to cause harm through reliability gaps, brittle reasoning, or unsafe interaction patterns. If you use a single capability metric to reason about both, you risk drawing the wrong conclusions about both safety and urgency.
A useful way to think about it is to separate two questions that are often conflated:
First: How likely is it that the system will succeed at tasks that enable harm?
Second: How likely is it that the system will fail in ways that still produce harm?
Those questions sound similar, but they point to different failure surfaces. The first is about enabling power—what the system can accomplish when it tries. The second is about fragility—what happens when the system is stressed, confused, or placed in environments it doesn’t fully understand.
Dangerous success: when capability becomes leverage
Dangerous success doesn’t require the system to be “evil.” It requires the system to be effective at the wrong objective, or effective at objectives that can be repurposed. In many real-world contexts, harm is not produced by a single dramatic act; it’s produced by leverage. A tool that can draft persuasive messages, automate planning, optimize logistics, or generate code at scale can be used for benign purposes—but also for fraud, coercion, cyber intrusion, or large-scale manipulation.
In that sense, dangerous success is less about whether the model “understands” morality and more about whether it can reliably produce outputs that move the world in harmful directions. The key variable is not just raw intelligence. It’s the combination of capability and access: what the system can do, how directly it can affect systems, and how quickly it can iterate based on feedback.
That’s why “how smart is it?” is an incomplete question. A model that scores highly on language tasks might still be low risk if it can’t act in the world. Conversely, a model that scores modestly on benchmarks could be high risk if it’s embedded in a workflow where small errors have outsized consequences—or where it can exploit loopholes in automation.
So what metrics should track dangerous success? Discussions increasingly emphasize risk-specific measures rather than generic performance. Examples include:
1) Task effectiveness under adversarial conditions
If a system can perform a harmful task better than humans, or better than existing automated tools, then capability growth is directly relevant. But the relevant test isn’t “does it answer questions?” It’s “does it produce actionable outputs that survive scrutiny, constraints, and adversarial prompting?”
2) Reliability of harmful subroutines
Even if the system is inconsistent overall, dangerous success can occur if it reliably performs particular steps: generating convincing narratives, producing plausible technical instructions, identifying vulnerabilities, or optimizing strategies. Metrics that focus on average performance can miss this. What matters is whether the system has stable competence in the components that enable harm.
3) Speed of iteration and feedback loops
A system that can generate harmful plans is one thing. A system that can refine them quickly using feedback—whether from users, tools, or simulated environments—can compound risk. This is why evaluation needs to consider not only outputs but also the dynamics of repeated use.
4) Transferability across domains
Capability that transfers is more dangerous than capability that stays narrow. If a system can take knowledge learned in one context and apply it to new harmful contexts with minimal adaptation, then capability gains become more broadly enabling.
Notice the pattern: these metrics are about “power to do harm,” not about “general intelligence.” They ask whether capability improvements translate into increased leverage in realistic settings.
Dangerous failure: when the system breaks in the wrong way
Dangerous failure is the scenario where the system fails—yet the failure itself causes harm. This can happen even when the system is not particularly competent at the harmful task. Sometimes the harm comes from incorrect outputs. Sometimes it comes from refusal behavior that triggers unsafe workarounds. Sometimes it comes from brittleness: the system behaves unpredictably when conditions shift slightly.
A common misconception is that failure is safer because it reduces the chance of successful wrongdoing. But in complex systems, failure can be dangerous precisely because it is hard to anticipate. When a system is integrated into workflows, the cost of being wrong can be high. And when the system is used at scale, even low-probability failures can become frequent enough to matter.
Dangerous failure also includes “miscalibration” risks: the system may appear confident while being wrong, or it may fail to recognize uncertainty. In high-stakes environments, this can lead to decisions that look rational but are actually grounded in errors.
What metrics should track dangerous failure? Here the emphasis shifts from capability to robustness and control:
1) Out-of-distribution behavior
Many harms arise when systems encounter inputs that differ from training or from the assumptions embedded in their design. Metrics should probe how the system behaves when it’s asked to operate outside its comfort zone—especially in ways that resemble real-world drift.
2) Calibration and uncertainty awareness
If a system cannot communicate uncertainty effectively, users may treat incorrect outputs as reliable. Measuring calibration—how often the system is correct when it claims confidence—becomes central to risk assessment.
3) Failure mode taxonomy
Instead of asking only “how often does it fail,” risk-focused evaluation asks “how does it fail?” Does it hallucinate plausible details? Does it refuse? Does it produce partial compliance? Does it follow instructions that conflict with safety constraints? Different failure modes have different downstream consequences.
4) Interaction and escalation dynamics
Some failures are not static. They emerge through interaction: a user corrects the system, the system adapts, and the conversation drifts toward unsafe territory. Metrics that evaluate single-turn performance can miss these trajectories. Evaluations need to consider multi-step interactions and the possibility of compounding errors.
5) Tool-use and system integration hazards
When models are connected to tools—search, code execution, scheduling, transaction systems—the failure surface expands. A model that is “mostly right” can still cause harm if it triggers the wrong tool action, misinterprets tool outputs, or fails to respect constraints. Integration testing becomes part of safety measurement, not an afterthought.
This is where the “it depends” framing becomes more than a slogan. The same model can be safe in one integration and dangerous in another. Dangerous failure is often an engineering problem as much as a modeling problem: guardrails, monitoring, human-in-the-loop design, and the structure of the environment all shape the failure modes that matter.
Why one-size-fits-all metrics mislead
If you try to measure AI risk with a single capability score, you end up treating different pathways to harm as if they were the same. That leads to two common errors.
The first error is overconfidence from benchmark improvement. Benchmarks often reward general competence in controlled settings. But dangerous success depends on whether competence translates into actionable harm under realistic constraints. A system can improve on benchmarks without meaningfully increasing harmful leverage—or it can increase leverage faster than benchmarks suggest, especially if the benchmarks don’t capture adversarial use, tool access, or iterative refinement.
The second error is underestimating harm from reliability gaps. Benchmarks can also mask dangerous failure modes. A system might score well on average yet fail catastrophically in rare cases. If those rare cases align with high-impact contexts, the risk is not “rare enough to ignore.” In safety terms, you care about tail behavior, not just mean performance.
This is why risk frameworks increasingly argue for interpreting metrics through the lens of the outcome you’re trying to prevent. Dangerous success and dangerous failure are not just two sides of the same coin; they are different coins minted from different metals.
A practical way to reframe “capability”
One unique take emerging in these discussions is to treat “capability” not as a universal property of the model, but as a property of the whole system: model plus interface plus environment plus incentives plus constraints. In that view, capability is conditional. It depends on what the system is asked to do, what it can access, and what happens when it is wrong.
This reframing doesn’t deny that model quality matters. It argues that model quality is only one component of risk. The same underlying model can produce different outcomes depending on:
– Whether it can take actions automatically or only propose text
– Whether it can call tools and execute code
– Whether it has memory across sessions
– Whether it can observe results and iterate
– Whether it is constrained by policies that are enforceable in practice
– Whether there are monitoring systems that catch anomalies early
– Whether humans can intervene effectively and quickly
So when someone asks, “How capable is AI?” the most accurate answer is: capable for what, under what conditions, and with
