METR Task-Completion Time Horizons

This visualisation builds on METR's task-completion time horizon methodology, which measures AI agent capabilities in terms of the duration of tasks — calibrated by human expert completion time — that agents can complete at 50% (p50) and 80% (p80) reliability thresholds. The metric has been shown to grow exponentially, with frontier model p50 time horizons approximately doubling every ~6 months since 2019. Kwa, T., West, B., Becker, J., et al. (2025). "Measuring AI Ability to Complete Long Tasks." arXiv:2503.14499
View
Scale
Models
Threshold Today +6 Months +12 Months Doubling Time
Methodology

What a "time horizon" measures

A time horizon is the duration of a task — measured by how long a human expert takes to complete it — where the AI agent succeeds at a given reliability level. It does not mean the AI works for that long; it measures the difficulty of the tasks it can handle. For example, a p50 time horizon of 60 minutes means the agent can complete 50% of tasks that take a human expert about one hour.

Task suite

  • The METR benchmark (version TH1.1) includes tasks spanning software engineering, machine learning, cybersecurity, and research-assistance domains.
  • Each task has a calibrated human-expert completion time, measured by having professional engineers and researchers perform the tasks under timed conditions.
  • Data is sourced from METR's published YAML at metr.org/assets/benchmark_results_1_1.yaml. Pre-2023 models (GPT-2, davinci-002, GPT-3.5 Turbo Instruct) were evaluated on the TH1.0 task suite; all 2023+ models use the expanded TH1.1 task suite. Values across versions are not directly comparable. The all-time trendline shown is a "stitched" hybrid of both versions, following METR's own methodology.

Reliability thresholds: p50 and p80

  • p50: the agent succeeds on 50% of tasks of that duration.
  • p80: the agent succeeds on 80% of tasks of that duration (always shorter than p50, since higher reliability requires easier tasks).
  • Both are reported directly by METR from empirical evaluation.

The p50/p80 gap

The large gap between p50 and p80 for some models (e.g. Claude Opus 4.5: ~5.3h p50 vs ~42 min p80) reflects the relatively flat logistic success curve. These models can handle long tasks at a 50% success rate but are much less reliable at higher thresholds — the probability of success drops gradually with task difficulty rather than sharply.

Doubling time computation

  • Doubling times are computed via exponential regression (log-linear OLS) on frontier model p50 values.
  • The full-trend (2019–2026) doubling time is approximately ~189 days (~6.2 months), using METR's stitched TH1.0 + TH1.1 trendline. This figure references METR's TH1.1 YAML (published value: 188.657 days).
  • The accelerated 2023+ trend (TH1.1 models only) shows a doubling time of approximately ~128 days (~4.2 months), with METR's published confidence interval of 105–155 days. Note: there is no published "2024+" figure in the YAML; the 2023+ CI should not be interpreted as a 2024-onward trend without additional analysis.
  • Projections beyond the current date use the full-trend regression and are shown as dashed trendlines.

Data source and units

All time horizon values are in minutes of human-expert time, sourced from the TH1.1 YAML published by METR. Confidence intervals (ci_low, ci_high within p50_horizon_length) are included in the dataset but not plotted on the chart for clarity. Model release dates, family classifications, and frontier flags (is_sota) are taken from the same YAML file.

SWE-rebench Frontier Progress

This visualisation draws on SWE-rebench, an automated benchmark of real-world software engineering tasks collected from thousands of GitHub repositories. Unlike manually curated benchmarks, SWE-rebench continuously harvests fresh tasks and tracks potential data contamination by comparing issue creation dates against model release dates. All models are evaluated under identical conditions using a standardised ReAct agent scaffold. Badertdinov, I., Golubev, A., Nekrashevich, M., et al. (2025). "SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents." arXiv:2505.20411
View
Source
Fit
Methodology

What SWE-rebench measures

SWE-rebench evaluates AI agents on real GitHub issues. Each task requires reading a codebase, understanding the problem described in the issue, and producing a working patch that passes the repository's test suite. The benchmark uses a rolling window of fresh tasks — issues are continuously harvested from active repositories and retired after a fixed period. This mitigates data contamination: models are unlikely to have trained on issues that did not exist when they were built.

Reported metric: pass@1 (averaged)

  • The "resolved rate" shown is pass@1 averaged over 5 independent runs under a standardised ReAct agent scaffold. Each run uses a fresh context and identical system prompts.
  • The scaffold provides the model with the repository contents, the issue description, and a fixed set of tools (file read, file write, shell execution). The model must autonomously decide what to do.
  • pass@5 (also shown in tooltips) is the probability of solving the issue in at least one of the 5 runs, and is always ≥ pass@1.

Frontier computation

  • Models are split into two categories: closed-source (teal) and open-source (terracotta). Each category maintains its own frontier.
  • The frontier is a step function. Models are processed in chronological order; a new frontier point is set whenever a model achieves a resolved rate strictly greater than the previous best in its category.

Trendline fitting

  • Logistic fit: a three-parameter logistic curve y = L / (1 + exp(-k(x - x0))) is fitted via gradient descent on frontier data points. The asymptote L is clamped to [50, 100], based on our assumption about the practical ceiling of automated issue resolution on continuously refreshed tasks. The fit uses 5,000 iterations with adaptive learning rates.
  • Linear fit: ordinary least-squares regression on frontier points. Shown as an alternative but is unrealistic long-term since it implies indefinite linear growth beyond 100%.

Rolling-window caveat

Because SWE-rebench uses a rolling window, a model's resolved rate may differ from earlier snapshots. Older tasks are retired and replaced with new ones, so the evaluation set changes over time. A model that scored 63% on a previous snapshot may show 44% on the current snapshot. This is by design — it ensures the benchmark remains meaningful as models improve, but means scores are not directly comparable across different snapshots.

Data source

All data is sourced from swe-rebench.com. Model names, resolved rates, and pass@5 values are taken directly from the leaderboard. No values are adjusted or normalised.

Classification caveat

The open-source flag and family classification are inferred from model names using keyword matching (e.g. "claude" → Anthropic, "gpt" → OpenAI), not sourced from the SWE-rebench leaderboard. This classification may be incorrect for models with ambiguous names.

Snapshot caveat

The data shown reflects a single snapshot of the SWE-rebench rolling evaluation window, captured at the collection timestamp shown in the footer. Because the task set rotates continuously, scores from different snapshots are not directly comparable. A model that scored higher on a previous snapshot may show a lower score on the current one.