Doublespeak: In-Context Representation Hijacking

active

Research project that introduces Doublespeak, an in-context representation hijacking attack against large language models where harmful keywords are systematically replaced with benign tokens across multiple in-context examples so that the benign token’s internal representation acquires the harmful semantics and bypasses standard safety alignment checks.