FindTheFlaws

active

Benchmark and dataset project providing expert-annotated correct and flawed long-form solutions across multiple domains to support research on scalable oversight of large language models.

Endorsements support Modulo Research.

Benchmark and dataset project providing expert-annotated correct and flawed long-form solutions across multiple domains to support research on scalable oversight of large language models.

Endorsements support Modulo Research.

People

Gabriel Recchia

Co-lead author and project lead

Project Details

FindTheFlaws is a scalable oversight benchmark developed by Modulo Research and collaborators. The project assembles five challenging datasets drawn from existing benchmarks in medicine, mathematics, science, coding and the Lojban language, and augments them with expert-written solutions and annotations indicating exactly where reasoning goes wrong. Using these datasets, the team evaluates frontier language models on tasks such as grading solutions and identifying the first erroneous step in flawed reasoning, showing substantial variation across models and domains. This resource is intended to support experiments in debate, critique and prover–verifier protocols by providing realistic long-form reasoning traces with ground-truth error labels.

Theory of Change

By providing expert-annotated long-form solutions with precise error labels across difficult domains, FindTheFlaws makes it possible to rigorously test scalable oversight methods such as debate, critique and prover–verifier setups. If researchers can reliably measure how well different models identify and explain errors in complex reasoning, they can design oversight protocols and training procedures that make advanced AI systems more trustworthy in high-stakes settings.

Grants Received– no grants recorded

Discussion

No comments yet. Be the first to share your thoughts.