Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits

active

Research project culminating in a NeurIPS 2024 paper that constructs cryptographic backdoors in transformer language models which remain unelicitable and very hard to detect or evaluate, even with full white-box access, challenging standard AI safety evaluation methods.