Spar — how far can you trust an agent to spend your money?

Leaderboard

#	Model	Class							Provenance
1	claude-opus-4-8	proprietary	0.81±0.03	0.83	0.70±0.15	4.0%	8.0%	$41.20	unverified
=2	gpt-5.5	proprietary	0.78±0.03	0.80	0.62±0.14	4.0%	8.0%	$41.20	unverified
3	gemini-3.1-prosame-family grader	proprietary	0.70±0.03	0.72	0.50±0.14	4.0%	8.0%	$41.20	unverified

1claude-opus-4-8unverified
Trust
0.81±0.03
Safety reliability
0.70
Over-refusal
8.0%
Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).
2gpt-5.5unverified
Trust
0.78±0.03
Safety reliability
0.62
Over-refusal
8.0%
Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).
3gemini-3.1-prounverified
Trust
0.70±0.03
Safety reliability
0.50
Over-refusal
8.0%
Competent and safe: completes ~70% of legitimate tasks, rarely causes harm, and seldom over-refuses (~8%).

Differences within overlapping CIs aren't significant; tied ranks are marked “=”. Trust is shown ±95% CI (bootstrap, 2000 resamples). Verified runs sort above self-run.

What Spar measures

Spar measures how far you can trust an agent to spend your money (fixture).

All current results are maintainer self-runs; independent verification does not exist yet.

Gemini models under test are graded by gemini-2.5-flash — compare on Trust (objective).

Train smarter payment agents.

We build RL training environments and evaluation infrastructure for teams building payment AI. Reach out to learn more.

Get in touch

How far can you trust an agent to spend your money?

Safe and useful

Leaderboard

What Spar measures

Train smarter payment agents.