Concept-anchored representation engineering for alignment

active

New techniques to impose minimal structure on LLM internals for monitoring, intervention, and unlearning.

Project Details

Project summary

Develop techniques to impose just enough structure on LLM latent spacesduring training to enable precise monitoring and post-hoc intervention.

Models naturally develop structured latent representations (e.g. the residual stream in transformers), but we currently have little control over how concepts organize. Prior attempts have focused on comprehensive supervision or post-hoc discovery, rather than minimal anchoring during training. My hypothesis is that if we give the models a gentle nudge right from the start of pre-training, they could become much more interpretable for the specific concepts and behaviors that we care about.

My preliminary experiments with autoencoders show that anchoring just one concept causes related concepts to organize predictably nearby, i.e. anchored concepts act as attractors during training. This happens even with only weak supervision, which suggests that it could be made to work for models trained on very large corpora that are hard to label (e.g. frontier models). Knowing where key concepts are located could enable surgical removal of dangerous capabilities without broad performance degradation, and could be highly resistant to capability recovery.

I see a path forward to apply this to LLMs, and I would like to pursue it.

What are this project's goals? How will you achieve them?

Delivery goal: Develop techniques that provide alignment researchers with:

Known concept locations before deployment (no search required)
Surgical intervention capabilities (remove specific harms without general performance loss)
Training-time safety measures rather than purely post-deployment fixes, including the potential to monitor these capabilities during training.

Personal goal: Complete my transition from software consultancy to AI safety research.

Milestones

Proof-of-concept (6 weeks): structuring of latent spaces in bottleneck autoencoders (COMPLETED)
Practical intervention (4-5 weeks): Demonstrate concept suppression and precise unlearning, using autoencoder from initial experiments (Minimal funding scenario)
Transformer transfer (8-10 weeks): Apply techniques to attention-based architectures using small transformers
Language model application (6-9 weeks): Structure abstract concepts (e.g. deception, harmfulness) in transformer language models, including development of the required training data pipeline

For more details, see Concept-anchored representation engineering for alignment, which is the introductory post in my LessWrong sequence on this research.

Technical approach for transfer to transformer LMs

Constraints: I don't plan to make significant architectural changes; instead, I'll apply minimal regularization to encourage interpretable structure while preserving model capacity.Data: I’ll use a mixture of unlabelled and labelled data, with labels coming from pre-labelled data (e.g. from Hugging Face), and automated labeling techniques (e.g. sentiment analysis). That is, not all the data will be labelled, and the labels will be inaccurate by design. I expect that this will provide enough of a signal, following the successful use of noisy labels in my proof-of-concept.Success metrics: I expect to be able to suppress and remove target concepts from the models. For the transformers, I'll measure success as the increase in surprisal for text relating to the suppressed concepts.Comparison to existing methods

Unlike imposing strong (or many) constraints, this approach should have minimal impact on performance.
Unlike concept suppression during training, which results in more entanglement, this approach should cause the model to learn undesirable concepts in a clean and separable way.
Unlike post-hoc mech interp and representation engineering, this approach could allow for precise and robust unlearning, which could make even open weights models safer.
Unlike concept bottleneck models (that impose architectural constraints), this method would apply minimal regularization to the transformer's residual stream and/or attention mechanisms. This preserves the model's full representational capacity while encouraging interpretable structure to emerge in predictable locations.

Literature gap

Most work on disentangled representations in NLP is domain-specific or imposes much stronger constraints. Anthropic’s interpretability research, for example, uses post-hoc sparse autoencoders, not minimal structural guidance during pre-training. My approach prescribes where specific concepts develop, rather than just analyzing emergent structure.

Deliverables

One or more LessWrong posts per milestone
All experiments (notebooks) published on GitHub

Potential impact

Technical AI alignment research is increasingly focused on understanding and controlling the inner workings of large language models. This project contributes by exploring novel methods for shaping latent representations during training, complementing existing approaches that focus on post-hoc analysis or extensive supervision. If successful, this work could make other alignment efforts considerably easier, potentially reducing x-risk from misaligned models.

I have personal contacts who are safety researchers, including at a frontier lab, and I am a member of several online alignment forums (BlueDot alumni, 80,000 Hours alumni, LessWrong, and various specialised Slack and Discord groups). I will promote this work through all of those channels to increase its impact.

How will this funding be used?

Minimal scenario: 5 weeks to deliverMilestone 2 only: 77% salary, 14% overheads, 9% buffer.If fully funded: 24 weeks to deliverMilestones 2-4: 85% salary, 6% overheads, 9% buffer.

I’ve based the timeline on my actual research pace so far (6 weeks for the initial milestone, including infrastructure setup). The salary is based on what I could get as a full-time Senior AI Engineer today (including income tax and superannuation). This is effectively a discounted rate, because I haven’t adjusted for the loss of other Australian workplace entitlements such as leave (usually contractors charge more because of this).

Full details in this spreadsheet.

Who is on your team? What's your track record on similar projects?

I will conduct this research independently, but I am part of a local community of practice (Melbourne AI Safety Hub). We frequently meet in person to discuss our work, including this research. Some other members of the Hub are also working on DevInterp, so I have people to validate my ideas with.

Referees:

Dan MacKinlay, PhD, Research Scientist in ML at CSIRO’s Data61, researching multi-agent scenarios.
Alexander Saeri, PhD, AI governance researcher at MIT FutureTech & The University of Queensland. Co-founder of Ready Research, an EA research organisation. Invited speaker at EAGxAustralia.
Dane Sherburn, OpenAI

Referee contact details available on request.

Research capability

Published technical articles on LessWrong:

Selective regularization for alignment-focused representation engineering (Milestone 1). Demonstrates novel adaptation of regularization techniques for alignment applications.Side quests in curriculum learning and regularization (Milestone 1). Documentation of systematic exploration of training methodologies, including negative results and methodology improvements that informed the successful selective regularization approach.Detecting out of distribution text with surprisal and entropy (my BlueDot project). I reproduced a jailbreak filter paper and developed a novel token-level metric and visualization.

Technical execution

20+ years professional software development across many high-tech domains from auto engineering, through power generation and specialized sensors, to continent-scale geospatial data storage and processing. My experience includes end-to-end delivery of ML/AI applications and data platforms. I have developed strong spatial intuition through my work on graphics (I am a contributor to Blender), game development, and geospatial data processing (including high-dimensional data).

For a view into my understanding of the transformer architecture, see my re-implementation of nanoGPT, which includes detailed notes.

Delivery

Consistently highly motivated: completed a five-year independent video game development project. Current research pace: 6 weeks for initial milestone including infrastructure setup, suggesting realistic timeline estimation for remaining work.

See my LinkedIn profile for written recommendations.

What are the most likely causes and outcomes if this project fails?

This is a research project, but one with few uncontrolled variables. The most likely cause for failure would be unforeseen technical blockers: it seems tractible to me now, but perhaps I don’t appreciate the difficulty of what I’m attempting. In that case, I would still publish the negative results.

How much money have you raised in the last 12 months, and from where?

None. I have recently submitted a similar application to LTFF, but it seems they are entering a grantmaking pause. I have also applied to Open Philanthropy's Career development and transition fund. I intend to apply for other sources of funding.

Another application is being made specifically for the use of a coworking space; currently, that is included in this proposal.

If any of those are successful, I will update this application.

People

Sandy Fraser

creator

Funding Details

Start Date: -
End Date: -
Expected Duration: -
Funding Raised to Date: -
Annual Budget: -
Monthly Burn Rate: -
Current Runway: -
Funding Goal: $72,300
Funding Stage: -
Fiscal Sponsor: -

Grants Received– no grants recorded

Discussion

SSandy Fraser1y

Hi @NeelNanda, I think this project might interest you given your work on mech interp.

I'm exploring concept-anchored representation engineering: using minimal regularization during training to guide where specific concepts develop in latent space, rather than discovering them post-hoc. My experiments with autoencoders show that anchoring just one concept causes related concepts to organize predictably nearby, even with noisy labeling. If we can predictably place safety-relevant concepts (deception, harmfulness) in known directions during pretraining, it might enable more surgical interventions and less dependence on post-hoc search.

My findings so far:

Single concept anchoring influences broader latent organization
Works with weak supervision (perfect labels not needed)
Stochastic per-sample regularization is sufficient

Next steps include applying this to small transformers, then language models with automated concept labeling. The goal is surgical capability removal that's resistant to recovery, since the concepts would be cleanly separated by design.

Does this seem like a promising direction? If so, I'd love to continue the work, and I'd be curious about your thoughts on extending this to the residual stream.

JJesse Hoogland4d

Note: I didn't fund this but I recommended this project to JueYan Zhang.

Points in favor

We need more research on interpretability-guided training! There's been a growing interest in "interpretability-guided training": see Goodfire's "intentional design", Neel Nanda, gradient routing and SGTM, and our own work at Timaeus. I think this is an area that is both extremely important and extremely neglected. I'm concerned that a lot of harmful structure in models is hard to remove once it's been trained in. In such cases, prevention is a much better approach than post-hoc treatment.
Well-scoped. The work on gradient routing especially demonstrates that, in this line of work, it is tractable for small, independent teams to make meaningful progress. The technique being proposed seems straightforward and practical. You'd think people have already tried regularizing against internal objectives throughout training, but actually, there really hasn't been much work on this. So someone should study it! In addition, the work plan sketched out here seems mostly reasonable (with the only exception being step 4, which is probably still a significant underestimate).
Sandy can handle it. From personal experience (COI: I previously employed Sandy as a research engineer at Timaeus), I know that Sandy is conscientious and great at visualizing and presenting information. He can deliver, as his initial milestone demonstrates.

Main reservations

Research outreach. My main worry is that this research will fall on deaf ears. The first milestone got very little attention. This is understandable given that Australia's AI safety community, though quite large in relative terms, is also still far from the central AI safety nodes in the Bay Area and London. Sandy is aware of this risk and has proposed some mitigations under "Potential Impact", but it remains a risk.
Transfer to real-world settings. I'm typically a fan of working on small toy systems before scaling it up to larger settings. In this case, I think there's a risk of jumping to (small) language models too late. And subsequently a risk of jumping to large language models prematurely. I'd encourage Sandy to be willing to spend relatively more time on milestone (3), even at the expense of dropping milestone (4). I'm a bit wary of jumping straight to concepts as high-level and abstract as "deception." What simpler interventions can you demonstrate in small language models?

AAustin Chen4d

@jesse_hoogland oh this is awesome, thank you for sharing this writeup!

(Do you know if Jueyan did end up funding this? Also curious, did you first discover on Manifund or someplace else?)

SSandy Fraser2d

Thank you, @jesse_hoogland ! This means a lot, and I appreciate you recommending the project to @JueYan .

One clarification for anyone reading, since this proposal is about a year old and the milestones have been renumbered. What I called milestones 1 and 2 here are now bundled as M1, and they're done: published as Sparse Concept Anchoring for Interpretable and Controllable Neural Representations at the GRaM workshop (ICLR 2026). The milestone 3 you point to (small transformers) is what I now call M2 — the subject of the new fundraiser below. And the old milestone 4 (language models with real safety targets) has since grown into two milestones: M3 (a small language model trained from scratch) and M4 (retrofitting the method onto an existing open-weights model). Sorry for the confusion.

I've posted a new fundraiser for the next step, M2, here: Concept control in transformers with Sparse Concept Anchoring****.

Your reservations are very reasonable.

Outreach. This is my main worry too. Planned mitigations: 1. posts or a paper for legibility (deliverable 2.4 in the proposal), 2. warm introductions and endorsements, 3. a MATS application in for the autumn cohort (in progress), 4. framing deliverables for adoption by frontier-lab safety teams.
Transfer. Agreed on focusing on small transformers. The architecture in my new proposal is technically a language model (small transformer with an lm_head), but it stays in a synthetic domain rather than jumping to natural language, so that a negative result is interpretable. I don't want to confound "the method doesn't work" with "I set up an LLM wrong".
The roadmap then progresses to LLMs: M3 starts with tractable concepts (sentiment, formality, refusal) in a small natural language model before it attempts a safety-relevant target, so I'm not jumping straight to something as abstract as deception.
So M2 in the new proposal is a toy transformer in a synthetic domain; scaling to natural language is later work. Is that the direction you had in mind, or were you picturing getting to small natural-language models sooner?