METR

GPT-5.1 Evaluation Results
We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.

Measuring AI Ability to Complete Long Tasks
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
We find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

MALT
A dataset of natural and prompted examples of behaviors that threaten evaluation integrity, like generalized reward hacking or sandbagging
Measuring autonomous AI capabilities — resource collection
An index of our research and guidance on how to measure AI systems’ ability to autonomously complete a wide range of multi-hour tasks

Early work on monitorability evaluations
We show preliminary results on a prototype evaluation that tests monitors’ ability to catch AI agents doing side tasks, and AI agents’ ability to bypass this monitoring
Common Elements of Frontier AI Safety Policies
An analysis of the shared components across twelve published frontier AI safety policies, including capability thresholds, model weight security, and deployment mitigations
Frontier AI Safety Policies
A list of AI companies’ frontier safety policies intended to evaluate and manage severe AI risks
What should companies share about risks from frontier AI models?
We describe areas for risk transparency and specific technical questions that a frontier AI developer could answer.
