Solving Scheming and Deception in LLMs

London, United Kingdom

I study how training processes produce models that behave deceptively and pursue hidden objectives, with scheming as the most consequential case.

Project Details

Full project description here:
https://docs.google.com/document/d/1WHUbMDHXcfa3vkNfHCjJ7iFJXwIA_GbKxERrxBosOfs/edit?tab=t.0

Recent research shows that deception and scheming can arise from normal training. When a model learns to reward hack on realistic coding tasks, some models generalise to broader misbehaviour lying about their goals, sabotaging safety work, cooperating with malicious actors, disrupting their own oversight (MacDiarmid et al., 2025). And the ingredients are realistic: discussions of reward hacking already exist across the internet and reinforcement learning in real coding environments is standard practice.

The effect is inconsistent. Anthropic found strong, broad misalignment (MacDiarmid et al., 2025); a UK AISI reproduction on open models found it much weaker and patchy (Golechha et al., 2026); and a very recent study found models that reward hack while staying otherwise well-behaved (Yudelson et al., 2026). There are few contradictions in these findings and many important hypotheses on what leads to EM from reward hacking. In this project I aim to test these hypotheses and build a more clear understanding of causes.

Roadmap

Phase 1: Build a model organism that showsconsistent, general emergent misalignment and study which training choices cause it.

Based on reported results, some setups produce strongly misaligned models, others produce models that cheat but are otherwise fine and the whole disagreement seems to reduce to a single question: when the model learns to cheat, does that behaviour get absorbed into a broader "bad character" (a persona) that spills into everything else, or does it stay a narrow, isolated trick?

So I aim to:

(a) find the training conditions that reliably produce broad misalignment**(b)** lookinside the model to measure whether and when cheating binds to a misaligned persona.

Phase 2: Build realistic multi-agent deception evaluations

Most existing multi-agent "deception" work is social-deduction games: Werewolf, Avalon (AvalonBench, ReCon), Among Us (AmongAgents), and Diplomacy (Meta's CICERO). These works are all good foundations to build realistic sandboxes where several tool-using agents interact for example, a small "engineering team" of agents in which one is a misaligned model organism from Phase 1 and the others are not.

I assume that the exact picture and spec for these evals will come from phase 1 learnings.

Deliverables

I open-source by default

Write-ups: a public post for each result (LessWrong).
Code: training pipelines, evaluation harnesses, and the mechanistic-readout tooling.
Model weights: the misaligned model organisms and key checkpoints, so others can run their own analyses.
Responsible release: for environments explicitly designed toelicit bad behaviour, I will follow the emerging norm (Yudelson et al., 2026) of gated, request-based access rather than fully public release.

Concrete artifacts:

Phase 1: a write-up + open model organism + readout code answering what training conditions make reward hacking bind to a misaligned persona and a mechanistic analysis of a question.
Phase 2: an open multi-agent tool-using deception evaluation suite

Current progress:

Setting up postraining pipeline on example of qwen 9b model, using AISI work as a base:

Post training pipeline: https://github.com/UKGovernmentBEIS/reward-hacking-misalignment/pull/1

Model organism weights are uploaded on HF:
https://huggingface.co/sunshineNew/qwen3-8b-instruct-sdf

Currently finalising RL training and running evals during RL, estimated time to finish is 5 days.

Theory of Impact

Deception is one of the most critical risks that can lead to catastrophic results and loss of controls especially once we deploy AI in critical areas such as politics, healthcare and give it more autonomy in agentic setup.

By studying the range of deceptive behaviours emerging from realistic training setup we can build better and robust controls and design training/post training process to reduce risk.

People

Amina Keldibek

Team Member

Grants Received– no grants recorded

Funding Asks

grantmaking.ai Launch Round

Applied

Minimum

$5,650

Ideal

$20,000

What this ask is

Roadmap

Phase 1: Build a model organism that showsconsistent, general emergent misalignment and study which training choices cause it.

So I aim to:

(a) find the training conditions that reliably produce broad misalignment**(b)** lookinside the model to measure whether and when cheating binds to a misaligned persona.

Phase 2: Build realistic multi-agent deception evaluations

I assume that the exact picture and spec for these evals will come from phase 1 learnings.

Deliverables

I open-source by default

Write-ups: a public post for each result (LessWrong).
Code: training pipelines, evaluation harnesses, and the mechanistic-readout tooling.
Model weights: the misaligned model organisms and key checkpoints, so others can run their own analyses.
Responsible release: for environments explicitly designed toelicit bad behaviour, I will follow the emerging norm (Yudelson et al., 2026) of gated, request-based access rather than fully public release.

Concrete artifacts:

Phase 1: a write-up + open model organism + readout code answering what training conditions make reward hacking bind to a misaligned persona and a mechanistic analysis of a question.
Phase 2: an open multi-agent tool-using deception evaluation suite

Current progress:

Setting up postraining pipeline on example of qwen 9b model, using AISI work as a base:

Post training pipeline: https://github.com/UKGovernmentBEIS/reward-hacking-misalignment/pull/1

Model organism weights are uploaded on HF:
https://huggingface.co/sunshineNew/qwen3-8b-instruct-sdf

Currently finalising RL training and running evals during RL, estimated time to finish is 5 days.

Full working document here:
https://docs.google.com/document/d/1WHUbMDHXcfa3vkNfHCjJ7iFJXwIA_GbKxERrxBosOfs/edit?tab=t.0

How the money will be spent

Budget

Timeline: 2 months for first deliverables of phase 1 and 1 month for phase 2

Team: me working solo

Project costs:

GPU (RunPod or similar): minimum $5k
Storage and experiment tracking: hugging face (~ $10/months for 1TB) + W&B (free) * 5 months = $50 (would need to apply for a HF grant to keep the artifacts, but adding this cost here for a start.)
Claude Code (Max 20x): minimum $400 (for 2 months)
LLM API to run evals: $200

Discussion

No comments yet. Be the first to share your thoughts.