o11y-bench

The first observability benchmark for AI agents

A standardized evaluation suite measuring how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.

View Leaderboard

GitHub

Read the Blog Post

Top Agents

View all →

gemini-3.1-pro-preview

82.5%

Pass^3

Provider

Google

Thinking

High

Pass@3

95.2%

Tasks

52/63

Date

2026-04-15

Total Cost

$46.13

Avg Cost

$0.244

gpt-5.4-2026-03-05

82.5%

Pass^3

Provider

OpenAI

Thinking

High

Pass@3

92.1%

Tasks

52/63

Date

2026-04-14

Total Cost

$32.30

Avg Cost

$0.171

claude-opus-4-6

81.0%

Pass^3

Provider

Anthropic

Thinking

High

Pass@3

92.1%

Tasks

51/63

Date

2026-04-14

Total Cost

$82.85

Avg Cost

$0.438

claude-opus-4-6

77.8%

Pass^3

Provider

Anthropic

Thinking

Off

Pass@3

93.7%

Tasks

49/63

Date

2026-04-14

Total Cost

$85.91

Avg Cost

$0.455

claude-sonnet-4-6

77.8%

Pass^3

Provider

Anthropic

Thinking

High

Pass@3

90.5%

Tasks

49/63

Date

2026-04-14

Total Cost

$46.73

Avg Cost

$0.247

#	Model	Provider	Thinking			Tasks			Date
1	gemini-3.1-pro-preview	Google	High	82.5%	95.2%	52/63	$46.13	$0.244	2026-04-15
2	gpt-5.4-2026-03-05	OpenAI	High	82.5%	92.1%	52/63	$32.30	$0.171	2026-04-14
3	claude-opus-4-6	Anthropic	High	81.0%	92.1%	51/63	$82.85	$0.438	2026-04-14
4	claude-opus-4-6	Anthropic	Off	77.8%	93.7%	49/63	$85.91	$0.455	2026-04-14
5	claude-sonnet-4-6	Anthropic	High	77.8%	90.5%	49/63	$46.73	$0.247	2026-04-14

Top 10 By Category

Category scores use Pass^3 consistency across the three benchmark trials per task. Green is 95%+, yellow is 80%+, and red is below 80%.

Swipe horizontally to compare category scores.

Model	Dashboards	Dashboards & Config	Investigation	Logs	Metrics	Traces
gemini-3.1-pro-preview	57%	100%	64%	90%	94%	85%
gpt-5.4-2026-03-05	86%	100%	82%	80%	88%	69%
claude-opus-4-6	86%	100%	73%	100%	75%	69%
claude-opus-4-6	71%	100%	64%	100%	75%	69%
claude-sonnet-4-6	71%	100%	82%	80%	81%	62%
gemini-3.1-pro-preview	29%	100%	82%	50%	100%	85%
claude-opus-4-6	43%	100%	73%	100%	88%	46%
gemini-3.1-pro-preview	71%	100%	64%	60%	81%	77%
gpt-5.4-2026-03-05	71%	100%	73%	80%	81%	46%
qwen/qwen3.6-plus	29%	100%	55%	80%	94%	69%

Featured Tasks

Browse all →

Dashboards & Config

audit-service-overview-datasources

Can you audit the saved "Service Overview Audit" dashboard (`service-overview-audit`) for me? I just want a quick panel-...

Dashboards & Config

audit-service-overview-variable

Audit the saved "Service Overview Variable Audit" dashboard (`service-overview-variable-audit`) for me. I want to know w...

Investigation

cache-incident-blast-radius

Before we call this a broad backend rollout issue, check the earlier user-service cache-refresh incident and tell me whe...