grantmaking.ai Launch Round
Recent research shows that deception and scheming can arise from normal training. When a model learns to reward hack on realistic coding tasks, some models generalise to broader misbehaviour lying about their goals, sabotaging safety work, cooperating with malicious actors, disrupting their own oversight (MacDiarmid et al., 2025). And the ingredients are realistic: discussions of reward hacking already exist across the internet and reinforcement learning in real coding environments is standard practice.
The effect is inconsistent. Anthropic found strong, broad misalignment (MacDiarmid et al., 2025); a UK AISI reproduction on open models found it much weaker and patchy (Golechha et al., 2026); and a very recent study found models that reward hack while staying otherwise well-behaved (Yudelson et al., 2026). There are few contradictions in these findings and many important hypotheses on what leads to EM from reward hacking. In this project I aim to test these hypotheses and build a more clear understanding of causes.
Roadmap
Phase 1: Build a model organism that showsconsistent, general emergent misalignment and study which training choices cause it.
Based on reported results, some setups produce strongly misaligned models, others produce models that cheat but are otherwise fine and the whole disagreement seems to reduce to a single question: when the model learns to cheat, does that behaviour get absorbed into a broader "bad character" (a persona) that spills into everything else, or does it stay a narrow, isolated trick?
So I aim to:
(a) find the training conditions that reliably produce broad misalignment**(b)** lookinside the model to measure whether and when cheating binds to a misaligned persona.
Phase 2: Build realistic multi-agent deception evaluations
Most existing multi-agent "deception" work is social-deduction games: Werewolf, Avalon (AvalonBench, ReCon), Among Us (AmongAgents), and Diplomacy (Meta's CICERO). These works are all good foundations to build realistic sandboxes where several tool-using agents interact for example, a small "engineering team" of agents in which one is a misaligned model organism from Phase 1 and the others are not.
I assume that the exact picture and spec for these evals will come from phase 1 learnings.
Deliverables
I open-source by default
-
Write-ups: a public post for each result (LessWrong).
-
Code: training pipelines, evaluation harnesses, and the mechanistic-readout tooling.
-
Model weights: the misaligned model organisms and key checkpoints, so others can run their own analyses.
-
Responsible release: for environments explicitly designed toelicit bad behaviour, I will follow the emerging norm (Yudelson et al., 2026) of gated, request-based access rather than fully public release.
Concrete artifacts:
-
Phase 1: a write-up + open model organism + readout code answering what training conditions make reward hacking bind to a misaligned persona and a mechanistic analysis of a question.
-
Phase 2: an open multi-agent tool-using deception evaluation suite
Current progress:
Setting up postraining pipeline on example of qwen 9b model, using AISI work as a base:
Post training pipeline: https://github.com/UKGovernmentBEIS/reward-hacking-misalignment/pull/1
Model organism weights are uploaded on HF:
https://huggingface.co/sunshineNew/qwen3-8b-instruct-sdf
Currently finalising RL training and running evals during RL, estimated time to finish is 5 days.
Full working document here:
https://docs.google.com/document/d/1WHUbMDHXcfa3vkNfHCjJ7iFJXwIA_GbKxERrxBosOfs/edit?tab=t.0
Budget
Timeline: 2 months for first deliverables of phase 1 and 1 month for phase 2
Team: me working solo
Project costs:
-
GPU (RunPod or similar): minimum $5k
-
Storage and experiment tracking: hugging face (~ $10/months for 1TB) + W&B (free) * 5 months = $50 (would need to apply for a HF grant to keep the artifacts, but adding this cost here for a start.)
-
Claude Code (Max 20x): minimum $400 (for 2 months)
-
LLM API to run evals: $200