o11y-bench

The first observability benchmark for AI agents

A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.

Top Agents

View all →

#1

gemini-3.1-pro-preview

82.5%

Pass^3

Provider

Google logo Google

Thinking

High

Pass@3

95.2%

Tasks

52/63

Date

2026-04-15

Total Cost

$46.13

Avg Cost

$0.244

#2

gpt-5.4-2026-03-05

82.5%

Pass^3

Provider

OpenAI logo OpenAI

Thinking

High

Pass@3

92.1%

Tasks

52/63

Date

2026-04-14

Total Cost

$32.30

Avg Cost

$0.171

#3

claude-opus-4-6

81.0%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

92.1%

Tasks

51/63

Date

2026-04-14

Total Cost

$82.85

Avg Cost

$0.438

#4

claude-opus-4-6

77.8%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

Off

Pass@3

93.7%

Tasks

49/63

Date

2026-04-14

Total Cost

$85.91

Avg Cost

$0.455

#5

claude-sonnet-4-6

77.8%

Pass^3

Provider

Anthropic logo Anthropic

Thinking

High

Pass@3

90.5%

Tasks

49/63

Date

2026-04-14

Total Cost

$46.73

Avg Cost

$0.247

Top 10 By Category

Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 95%+, yellow is 80%+, and red is below 80%.

Swipe horizontally to compare category scores.

Model DashboardsDashboards & ConfigInvestigationLogsMetricsTraces
gemini-3.1-pro-preview 57% 100% 64% 90% 94% 85%
gpt-5.4-2026-03-05 86% 100% 82% 80% 88% 69%
claude-opus-4-6 86% 100% 73% 100% 75% 69%
claude-opus-4-6 71% 100% 64% 100% 75% 69%
claude-sonnet-4-6 71% 100% 82% 80% 81% 62%
gemini-3.1-pro-preview 29% 100% 82% 50% 100% 85%
claude-opus-4-6 43% 100% 73% 100% 88% 46%
gemini-3.1-pro-preview 71% 100% 64% 60% 81% 77%
gpt-5.4-2026-03-05 71% 100% 73% 80% 81% 46%
qwen/qwen3.6-plus 29% 100% 55% 80% 94% 69%

Featured Tasks

Browse all →