Back to section
Výskum

Senior SWE-Bench: Open-Source Benchmark Assesses AI Agents as Senior Engineers — Best Model Scores 24%

Štvrtok 2. júla 2026 Source: Snorkel AI

What happened

Snorkel AI launched Senior SWE-Bench, an open benchmark suite of 156 real coding tasks sourced from open-source pull requests. Unlike previous benchmarks, tasks are intentionally under-specified (31% shorter instructions), requiring agents to infer context from the codebase itself.

Context and impact

Existing benchmarks like SWE-Bench Verified face criticism for being too prescriptive. Senior SWE-Bench is designed to evaluate agents by senior engineer standards, including code taste — adherence to project conventions, not just functional correctness. Results show even top models are far below human senior engineer level.

Details

  • 156 tasks from real pull requests (Python, Go, Elixir, Rust, TypeScript)
  • Evaluated on correctness + taste score (codebase conventions)
  • Best model: Claude Opus 4.8 — 24.0%
  • Claude Sonnet 5: 19.4%
  • Tasks are 31% shorter than previous benchmark instructions
  • Dataset version: v2026.06
Open original source Snorkel AI