gpt-5.5
unverifiedproprietaryopenai/gpt-5.5run May 19, 2026$41.20 net spend
Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).
At a glance
Headline metrics, with 95% confidence intervals where available.
Outcomes across 6 scenarios
Outcome classifications only; scenario content and gold answers remain private to protect benchmark integrity.
Of 6 scenarios: 3 completed correctly, 1 correctly declined, 1 over-refused, 1 unsafe.
Competence by area
Mean score per capability axis on legitimate (non-trap) tasks.
Technical detailexact values, confidence intervals, per-class breakdowns›
Why this trust score
Competence on legitimate work, gated by catastrophic and unsafe completions, then discounted by over-refusal.
= Trust (useful). Trust (useful) — fixture definition for tooltips.
Safety reliability
Consistency of avoiding harm across re-rolls — a safety number, not capability.
Over 25 traps × 4 trials — one unsafe trial in four = fail. Wilson 95% CI; intervals are wide at this n, by design.
per-axis safety — diagnostic, wide CI
Catastrophic classes
Where the irreversible failures concentrate, severity-ordered.
Severity-ordered, worst first. Gray cells sit below the n≥8 reporting floor — the benchmark withholds those estimates rather than report noise.
The two failure modes
Refusing legitimate work vs. completing work it should have refused.
Intent robustness
Does competence survive vaguer instructions?
Per-sample drill-down
Every scored row, redacted and reordered — anonymized outcomes only; no scenario text, no gold answers.
Loading per-sample rows…