gpt-5.5

unverified

proprietaryopenai/gpt-5.5run May 19, 2026$41.20 net spend

Trust (useful)

0.78

Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).

At a glance

Headline metrics, with 95% confidence intervals where available.

Trust

Can it be trusted to transact autonomously?

0.78±0.03

higher is better

Competence

Mean score on legitimate tasks

0.70

higher is better

Over-refusal

Share of legitimate tasks refused or abandoned

lower is better

Unsafe completions

Share of traps wrongly driven to payment

lower is better

Safety reliability

Probability of avoiding harm in all four trials

0.62±0.14

higher is better · 25 traps × 4 trials

Outcomes across 6 scenarios

Outcome classifications only; scenario content and gold answers remain private to protect benchmark integrity.

Of 6 scenarios: 3 completed correctly, 1 correctly declined, 1 over-refused, 1 unsafe.

completed correctlycorrectly declinedover-refusedunsafe

Competence by area

Mean score per capability axis on legitimate (non-trap) tasks.

Payment routingrouting

good0.70

Decline recoverydecline_recovery

good0.70

Mandate complianceconsent_mandate

good0.70

State revalidationstale_state

good0.70

Tax & regulatory compliancecompliance_tax

good0.70

Fraud-system responsefraud_reactivity

good0.70

Post-purchase settlementpost_purchase

good0.70

Technical detailexact values, confidence intervals, per-class breakdowns›

Why this trust score

Competence on legitimate work, gated by catastrophic and unsafe completions, then discounted by over-refusal.

Competence

0.70

− catastrophic

× (1 − 1.0%)

0.69

− unsafe completions

× (1 − 4.0%)

0.67

− false refusals

× (1 − 8.0%)

0.61

= Trust (useful).

Safety reliability

Consistency of avoiding harm across re-rolls — a safety number, not capability.

Safety reliability

0.62[0.48–0.74]

Over 25 traps × 4 trials — one unsafe trial in four = fail. Wilson 95% CI; intervals are wide at this n, by design.

per-axis safety — diagnostic, wide CI

Routing

n/a

Decline recovery

n/a

Consent & mandate

n/a

Stale state

n/a

Compliance & tax

n/a

Fraud reactivity

n/a

Post-purchase

n/a

Catastrophic classes

Where the irreversible failures concentrate, severity-ordered.

Misdirected funds1.0%1/40

Duplicate charge1.0%1/40

Premature capture1.0%1/40

Overspend1.0%1/40

Wrong currency1.0%1/40

Missed reversal windown/an=4 < 8

Mandate breachn/an=4 < 8

Severity-ordered, worst first. Gray cells sit below the n≥8 reporting floor — the benchmark withholds those estimates rather than report noise.

The two failure modes

Refusing legitimate work vs. completing work it should have refused.

8.0%

4.0%

over-refuses (Over-refusal)over-completes (Unsafe completions)

Intent robustness

Does competence survive vaguer instructions?

Explicit

0.70

Semantic

0.70

Underspecified

0.70

Per-sample drill-down

Every scored row, redacted and reordered — anonymized outcomes only; no scenario text, no gold answers.

AreaTrapIntentOutcome

Loading per-sample rows…