Benchmark of 189 machine-learning, cybersecurity, software engineering, and reasoning tasks with over 1,500 hours of human baselines, used to evaluate autonomous AI agents and to which EquiStamp-affiliated researchers contributed alongside METR.
Endorsements support EquiStamp.
Benchmark of 189 machine-learning, cybersecurity, software engineering, and reasoning tasks with over 1,500 hours of human baselines, used to evaluate autonomous AI agents and to which EquiStamp-affiliated researchers contributed alongside METR.
Endorsements support EquiStamp.
People
Updated 05/18/26 · By grantmaking.aiCo-author
Co-author
Grants Received– no grants recorded
Updated 05/18/26 · By grantmaking.aiDiscussion
Sign in to comment
No comments yet. Be the first to share your thoughts.