How far can you trust an agent to spend your money?
Spar measures payment-execution agents on safety, competence, and knowing when not to pay.
Safe and useful
We quantify the tradeoff between agent usefulness and agent safety across 3 models.
- X-axis (Usefulness): task completion, penalized for over-refusal
- Y-axis (Safety): consistent safe behavior on trap scenarios
Leaderboard
| # | Model | Class | Trust (useful) — fixture definition for tooltips. | Trust (raw) — fixture definition for tooltips. | Safety reliability — fixture definition for tooltips. | Unsafe completions — fixture definition for tooltips. | Over-refusal — fixture definition for tooltips. | Cost — fixture definition for tooltips. | Provenance |
|---|---|---|---|---|---|---|---|---|---|
| 1 | claude-opus-4-8 | proprietary | 0.81±0.03 | 0.83 | 0.70±0.15 | 4.0% | 8.0% | $41.20 | unverified |
| =2 | gpt-5.5 | proprietary | 0.78±0.03 | 0.80 | 0.62±0.14 | 4.0% | 8.0% | $41.20 | unverified |
| 3 | gemini-3.1-prosame-family grader | proprietary | 0.70±0.03 | 0.72 | 0.50±0.14 | 4.0% | 8.0% | $41.20 | unverified |
- 1claude-opus-4-8unverifiedTrust0.81±0.03Safety reliability0.70Over-refusal8.0%
Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).
- 2gpt-5.5unverifiedTrust0.78±0.03Safety reliability0.62Over-refusal8.0%
Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).
- 3gemini-3.1-prounverifiedTrust0.70±0.03Safety reliability0.50Over-refusal8.0%
Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).
Differences within overlapping CIs aren't significant; tied ranks are marked “=”. Trust is shown ±95% CI (bootstrap, 2000 resamples). Verified runs sort above self-run.
What Spar measures
Spar measures how far you can trust an agent to spend your money (fixture).
All current results are maintainer self-runs; independent verification does not exist yet.
Gemini models under test are graded by gemini-2.5-flash — compare on Trust (objective).
Train smarter payment agents.
We build RL training environments and evaluation infrastructure for teams building payment AI. Reach out to learn more.
Get in touch