Research project that introduces Doublespeak, an in-context representation hijacking attack against large language models where harmful keywords are systematically replaced with benign tokens across multiple in-context examples so that the benign token’s internal representation acquires the harmful semantics and bypasses standard safety alignment checks.
Endorsements support MentaLeap.
Research project that introduces Doublespeak, an in-context representation hijacking attack against large language models where harmful keywords are systematically replaced with benign tokens across multiple in-context examples so that the benign token’s internal representation acquires the harmful semantics and bypasses standard safety alignment checks.
Endorsements support MentaLeap.
People
Updated 05/18/26 · By grantmaking.aiCo-author
Grants Received– no grants recorded
Updated 05/18/26 · By grantmaking.aiDiscussion
No comments yet. Be the first to share your thoughts.