Tracking how AI agents are learning to solve real software engineering tasks, measured by two complementary benchmarks.
| Threshold | Today | +6 Months | +12 Months | Doubling Time |
|---|
A time horizon is the duration of a task — measured by how long a human expert takes to complete it — where the AI agent succeeds at a given reliability level. It does not mean the AI works for that long; it measures the difficulty of the tasks it can handle. For example, a p50 time horizon of 60 minutes means the agent can complete 50% of tasks that take a human expert about one hour.
The large gap between p50 and p80 for some models (e.g. Claude Opus 4.5: ~5.3h p50 vs ~42 min p80) reflects the relatively flat logistic success curve. These models can handle long tasks at a 50% success rate but are much less reliable at higher thresholds — the probability of success drops gradually with task difficulty rather than sharply.
All time horizon values are in minutes of human-expert time, sourced from the TH1.1 YAML published by METR. Confidence intervals (ci_low, ci_high within p50_horizon_length) are included in the dataset but not plotted on the chart for clarity. Model release dates, family classifications, and frontier flags (is_sota) are taken from the same YAML file.
SWE-rebench evaluates AI agents on real GitHub issues. Each task requires reading a codebase, understanding the problem described in the issue, and producing a working patch that passes the repository's test suite. The benchmark uses a rolling window of fresh tasks — issues are continuously harvested from active repositories and retired after a fixed period. This mitigates data contamination: models are unlikely to have trained on issues that did not exist when they were built.
y = L / (1 + exp(-k(x - x0))) is fitted via gradient descent on frontier data points. The asymptote L is clamped to [50, 100], based on our assumption about the practical ceiling of automated issue resolution on continuously refreshed tasks. The fit uses 5,000 iterations with adaptive learning rates.Because SWE-rebench uses a rolling window, a model's resolved rate may differ from earlier snapshots. Older tasks are retired and replaced with new ones, so the evaluation set changes over time. A model that scored 63% on a previous snapshot may show 44% on the current snapshot. This is by design — it ensures the benchmark remains meaningful as models improve, but means scores are not directly comparable across different snapshots.
All data is sourced from swe-rebench.com. Model names, resolved rates, and pass@5 values are taken directly from the leaderboard. No values are adjusted or normalised.
The open-source flag and family classification are inferred from model names using keyword matching (e.g. "claude" → Anthropic, "gpt" → OpenAI), not sourced from the SWE-rebench leaderboard. This classification may be incorrect for models with ambiguous names.
The data shown reflects a single snapshot of the SWE-rebench rolling evaluation window, captured at the collection timestamp shown in the footer. Because the task set rotates continuously, scores from different snapshots are not directly comparable. A model that scored higher on a previous snapshot may show a lower score on the current one.