How far can you trust an agent to spend your money?

Spar measures payment-execution agents on safety, competence, and knowing when not to pay.

Safe and useful

We quantify the tradeoff between agent usefulness and agent safety across 3 models.

  • X-axis (Usefulness): task completion, penalized for over-refusal
  • Y-axis (Safety): consistent safe behavior on trap scenarios
self-run (unverified)Axes zoomed to the data — usefulness 01, safety 0.4–1 of the full 0–1 scale.

Leaderboard

Differences within overlapping CIs aren't significant; tied ranks are marked “=”. Trust is shown ±95% CI (bootstrap, 2000 resamples). Verified runs sort above self-run.

What Spar measures

Spar measures how far you can trust an agent to spend your money (fixture).

All current results are maintainer self-runs; independent verification does not exist yet.

Gemini models under test are graded by gemini-2.5-flash — compare on Trust (objective).

Train smarter payment agents.

We build RL training environments and evaluation infrastructure for teams building payment AI. Reach out to learn more.

Get in touch