Senior SWE-Bench: Open-Source Benchmark Assesses AI Agents as Senior Engineers — Best Model Scores 24%
What happened
Snorkel AI launched Senior SWE-Bench, an open benchmark suite of 156 real coding tasks sourced from open-source pull requests. Unlike previous benchmarks, tasks are intentionally under-specified (31% shorter instructions), requiring agents to infer context from the codebase itself.
Context and impact
Existing benchmarks like SWE-Bench Verified face criticism for being too prescriptive. Senior SWE-Bench is designed to evaluate agents by senior engineer standards, including code taste — adherence to project conventions, not just functional correctness. Results show even top models are far below human senior engineer level.
Details
- 156 tasks from real pull requests (Python, Go, Elixir, Rust, TypeScript)
- Evaluated on correctness + taste score (codebase conventions)
- Best model: Claude Opus 4.8 — 24.0%
- Claude Sonnet 5: 19.4%
- Tasks are 31% shorter than previous benchmark instructions
- Dataset version: v2026.06
Open original source
Snorkel AI