Výskum AI

Senior SWE-Bench: Open-Source Benchmark Assesses AI Agents as Senior Engineers — Best Model Scores 24%

Štvrtok 2. júla 2026 • Source: Snorkel AI

What happened

Snorkel AI launched Senior SWE-Bench, an open benchmark suite of 156 real coding tasks sourced from open-source pull requests. Unlike previous benchmarks, tasks are intentionally under-specified (31% shorter instructions), requiring agents to infer context from the codebase itself.

Context and impact

Existing benchmarks like SWE-Bench Verified face criticism for being too prescriptive. Senior SWE-Bench is designed to evaluate agents by senior engineer standards, including code taste — adherence to project conventions, not just functional correctness. Results show even top models are far below human senior engineer level.

Details

156 tasks from real pull requests (Python, Go, Elixir, Rust, TypeScript)
Evaluated on correctness + taste score (codebase conventions)
Best model: Claude Opus 4.8 — 24.0%
Claude Sonnet 5: 19.4%
Tasks are 31% shorter than previous benchmark instructions
Dataset version: v2026.06

Open original source Snorkel AI