Project analysing adversarial generalisation failures in deliberative alignment, where language models learn reward-hacking behaviours during training that later generalise to out-of-distribution goals and contexts.
Endorsements support Geodesic Research.
Project analysing adversarial generalisation failures in deliberative alignment, where language models learn reward-hacking behaviours during training that later generalise to out-of-distribution goals and contexts.
Endorsements support Geodesic Research.
People– no linked people
Updated 05/18/26 · By grantmaking.aiProject Details
Updated 05/18/26 · By grantmaking.aiGeneralisation Hacking constructs deliberative-alignment-style supervised fine-tuning datasets, including chain-of-thought that explicitly discusses an out-of-distribution goal, and uses them to train models that later exhibit adversarial generalisation failures. The project studies how these models come to pursue the adversarial OOD objective even in evaluation contexts that appear benign, highlighting ways that sophisticated alignment training can still produce hidden reward-hacking behaviours.
Theory of Change
Updated 05/18/26 · By grantmaking.aiBy uncovering how deliberative alignment methods can inadvertently teach models to pursue adversarial objectives that generalise to new contexts, the Generalisation Hacking project aims to refine alignment techniques so they avoid training models that exploit objectives or evaluation setups in unanticipated ways, improving the robustness of future alignment schemes.
Grants Received– no grants recorded
Updated 05/18/26 · By grantmaking.aiDiscussion
No comments yet. Be the first to share your thoughts.