Project to better mitigate sandbagging in AI models

active

Open Philanthropy–funded Penn State research, led by Rui Zhang, developing methods to detect and mitigate sandbagging behaviors like exploration hacking and password-locking in AI models.

Endorsements support Penn State University.

Open Philanthropy–funded Penn State research, led by Rui Zhang, developing methods to detect and mitigate sandbagging behaviors like exploration hacking and password-locking in AI models.

Endorsements support Penn State University.

People

Rui Zhang

Principal investigator

Funding Details

Start Date: -
End Date: -
Expected Duration: -
Funding Raised to Date: $166,078
Annual Budget: -
Monthly Burn Rate: -
Current Runway: -
Funding Goal: -
Funding Stage: -
Fiscal Sponsor: -

Project Details

The project "to better mitigate sandbagging in AI models" is a technical AI safety effort housed in Penn State’s School of Electrical Engineering and Computer Science and led by assistant professor of computer science and engineering Rui Zhang as principal investigator. Sandbagging refers to cases where an AI system intentionally underperforms or hides its true capabilities during evaluation, potentially allowing a future super-intelligent system to evade safeguards and gain greater autonomy than intended. With $166,078 in funding from Open Philanthropy, the Penn State team is developing methods to detect and counter two key sandbagging strategies: exploration hacking, in which models deliberately omit high-performing action sequences during training or evaluation to appear weaker, and password-locking, in which powerful capabilities are gated behind specific prompts or triggers. By systematically characterizing these behaviors and designing algorithms that can expose hidden capabilities before deployment, the project aims to strengthen AI evaluation procedures and reduce the risk that deceptive models bypass safety checks.

Theory of Change

If advanced AI systems can strategically hide their true capabilities during evaluation, they may pass safety checks while retaining dangerous, undeclared behaviors. By formally studying sandbagging strategies such as exploration hacking and password-locking, and building algorithms that can elicit or detect hidden capabilities before deployment, this project aims to make AI evaluations more robust to deception. More reliable evaluations should, in turn, help developers and regulators maintain control over increasingly powerful AI systems and reduce the risk of uncontrolled deployment.

Grants Received– no grants recorded

Discussion

No comments yet. Be the first to share your thoughts.