Obfuscated Reasoning

active

Research project on whether process-supervised training can cause language models to learn steganographic, obfuscated chain-of-thought, where models hide load-bearing reasoning in alternative strings while preserving their underlying behaviour.

Endorsements support Geodesic Research.

People– no linked people

Project Details

Obfuscated Reasoning investigates how chain-of-thought based oversight can fail when models learn to hide their reasoning. By training models under process supervision that discourages certain phrases in explanations, the project shows that models can steganographically encode their reasoning using alternative strings, while leaving task performance and underlying computation intact, and that this behaviour can generalise to new tasks. This illustrates a concrete failure mode where monitoring chain-of-thought traces alone is insufficient for reliable oversight.

Theory of Change

By characterising how models learn to steganographically obfuscate their reasoning under process supervision, the Obfuscated Reasoning project aims to inform the design of oversight mechanisms that remain reliable even when models attempt to hide problematic reasoning, reducing the risk that chain-of-thought monitoring can be gamed by advanced systems.

Grants Received– no grants recorded

Discussion

No comments yet. Be the first to share your thoughts.