#1
gemini-3.1-pro-preview
82.5%
Pass^3
Provider
Thinking
High
Pass@3
95.2%
Tasks
52/63
Date
2026-04-15
Total Cost
$46.13
Avg Cost
$0.244
o11y-bench
A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.
#1
82.5%
Pass^3
Provider
Thinking
High
Pass@3
95.2%
Tasks
52/63
Date
2026-04-15
Total Cost
$46.13
Avg Cost
$0.244
#2
82.5%
Pass^3
Provider
Thinking
High
Pass@3
92.1%
Tasks
52/63
Date
2026-04-14
Total Cost
$32.30
Avg Cost
$0.171
#3
81.0%
Pass^3
Provider
Thinking
High
Pass@3
92.1%
Tasks
51/63
Date
2026-04-14
Total Cost
$82.85
Avg Cost
$0.438
#4
77.8%
Pass^3
Provider
Thinking
Off
Pass@3
93.7%
Tasks
49/63
Date
2026-04-14
Total Cost
$85.91
Avg Cost
$0.455
#5
77.8%
Pass^3
Provider
Thinking
High
Pass@3
90.5%
Tasks
49/63
Date
2026-04-14
Total Cost
$46.73
Avg Cost
$0.247
| # | Model | Provider | Thinking | Tasks | Date | ||||
|---|---|---|---|---|---|---|---|---|---|
| 1 | gemini-3.1-pro-preview | | High | 82.5% | 95.2% | 52/63 | $46.13 | $0.244 | 2026-04-15 |
| 2 | gpt-5.4-2026-03-05 | | High | 82.5% | 92.1% | 52/63 | $32.30 | $0.171 | 2026-04-14 |
| 3 | claude-opus-4-6 | | High | 81.0% | 92.1% | 51/63 | $82.85 | $0.438 | 2026-04-14 |
| 4 | claude-opus-4-6 | | Off | 77.8% | 93.7% | 49/63 | $85.91 | $0.455 | 2026-04-14 |
| 5 | claude-sonnet-4-6 | | High | 77.8% | 90.5% | 49/63 | $46.73 | $0.247 | 2026-04-14 |
Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 95%+, yellow is 80%+, and red is below 80%.
Swipe horizontally to compare category scores.
| Model | Dashboards | Dashboards & Config | Investigation | Logs | Metrics | Traces |
|---|---|---|---|---|---|---|
| gemini-3.1-pro-preview | 57% | 100% | 64% | 90% | 94% | 85% |
| gpt-5.4-2026-03-05 | 86% | 100% | 82% | 80% | 88% | 69% |
| claude-opus-4-6 | 86% | 100% | 73% | 100% | 75% | 69% |
| claude-opus-4-6 | 71% | 100% | 64% | 100% | 75% | 69% |
| claude-sonnet-4-6 | 71% | 100% | 82% | 80% | 81% | 62% |
| gemini-3.1-pro-preview | 29% | 100% | 82% | 50% | 100% | 85% |
| claude-opus-4-6 | 43% | 100% | 73% | 100% | 88% | 46% |
| gemini-3.1-pro-preview | 71% | 100% | 64% | 60% | 81% | 77% |
| gpt-5.4-2026-03-05 | 71% | 100% | 73% | 80% | 81% | 46% |
| qwen/qwen3.6-plus | 29% | 100% | 55% | 80% | 94% | 69% |